Releases: b4rtaz/distributed-llama
0.10.3
0.10.2
This version introduces a new CLI argument: --max-seq-len <n>
. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference
, dllama chat
, and dllama-api
. You don't need to set it in the worker because the root node will distribute the information to the worker.
Example:
./dllama chat --model ... --nthreads 8 --max-seq-len 1024
0.10.1
0.10.0
This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).
Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context
The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc
argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference
, dllama worker
, and dllama-api
commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.
How to run Llama 3.1 8B
- Download Distributed Llama repository and compile it:
make dllama && make dllama-api
. - Download model
python launch.py llama3_1_8b_instruct_q40
- Run model:
./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999
or./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999
If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc
argument.
./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8
TODO
A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.
The 0.10.2 version introduced the --max-seq-len <n>
argument.
0.9.2
This version allows to override the chat template. This may be helpful if a model does not have a tokenizer with a chat template.
How to use:
./dllama... --chat-template llama3
./dllama-api ... --chat-template llama3
Supported values:
llama2
llama3
zephyr
chatml
0.9.1
0.9.0
This version introduces breaking changes in the tokenizer format. Now the tokenizer contains the whole chat template in the huggingface format.
Breaking changes
You need to regenerate your tokenizer. The tokenizer from the 0.8.0 version won't work with the 0.9.0 version.
0.8.0
This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:
dllama chat ...
dllama-api ...
Breaking changes
The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py
script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.