Skip to content

0.10.0

Compare
Choose a tag to compare
@b4rtaz b4rtaz released this 25 Jul 11:54
· 11 commits to main since this release
4b8a0ca

This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).

Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context
Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context

The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference, dllama worker, and dllama-api commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.

How to run Llama 3.1 8B

  1. Download Distributed Llama repository and compile it: make dllama && make dllama-api.
  2. Download model python launch.py llama3_1_8b_instruct_q40
  3. Run model:
    • ./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999 or
    • ./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999

If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc argument.

./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8

TODO

A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.

The 0.10.2 version introduced the --max-seq-len <n> argument.