Skip to content

Releases: b4rtaz/distributed-llama

0.10.3

10 Aug 22:27
3353d56
Compare
Choose a tag to compare

This version refactors the code to reduce the use of the writeMany and readMany methods.

0.10.2

29 Jul 12:23
71135e6
Compare
Choose a tag to compare

This version introduces a new CLI argument: --max-seq-len <n>. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference, dllama chat, and dllama-api. You don't need to set it in the worker because the root node will distribute the information to the worker.

Example:

./dllama chat --model ... --nthreads 8 --max-seq-len 1024

0.10.1

28 Jul 14:29
Compare
Choose a tag to compare

Implemented the fallback implementation for the matmulQ40vQ80 operation. Distributed Llama now supports all CPU architectures, with optimizations specifically for ARM and AVX2 CPUs.

0.10.0

25 Jul 11:54
4b8a0ca
Compare
Choose a tag to compare

This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).

Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context
Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context

The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference, dllama worker, and dllama-api commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.

How to run Llama 3.1 8B

  1. Download Distributed Llama repository and compile it: make dllama && make dllama-api.
  2. Download model python launch.py llama3_1_8b_instruct_q40
  3. Run model:
    • ./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999 or
    • ./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999

If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc argument.

./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8

TODO

A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.

The 0.10.2 version introduced the --max-seq-len <n> argument.

0.9.2

12 Jul 20:10
90d7ebd
Compare
Choose a tag to compare

This version allows to override the chat template. This may be helpful if a model does not have a tokenizer with a chat template.

How to use:

./dllama... --chat-template llama3
./dllama-api ... --chat-template llama3

Supported values:

  • llama2
  • llama3
  • zephyr
  • chatml

0.9.1

01 Jun 14:22
08b4bcf
Compare
Choose a tag to compare

The --weights-float-type argument is optional now for models converted by using a converter from the 0.9.1 version or above.

0.9.0

01 Jun 12:18
a5e0445
Compare
Choose a tag to compare

This version introduces breaking changes in the tokenizer format. Now the tokenizer contains the whole chat template in the huggingface format.

Breaking changes

You need to regenerate your tokenizer. The tokenizer from the 0.8.0 version won't work with the 0.9.0 version.

0.8.0

31 May 20:49
6eccd30
Compare
Choose a tag to compare

This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:

  • dllama chat ...
  • dllama-api ...

Breaking changes

The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.

0.7.4

29 May 22:15
dc997b4
Compare
Choose a tag to compare

dllama-api:

  • Resolved a problem with adding unwanted <|eot_id|> to the response.
  • Introduced the naive cache that speeds up the inference in a chat client like AnythingLLM (demo).

0.7.3

27 May 21:15
Compare
Choose a tag to compare

This version adds Windows support. 🎉🎉🎉 Thanks @DifferentialityDevelopment!

Additionaly this version introduces the limit of nodes: nSlices <= nKvHeads. More details here.