Releases · b4rtaz/distributed-llama

10 Aug 22:27

b4rtaz

v0.10.3

3353d56

0.10.3 Latest

Latest

This version refactors the code to reduce the use of the writeMany and readMany methods.

Assets 2

29 Jul 12:23

b4rtaz

v0.10.2

71135e6

0.10.2

This version introduces a new CLI argument: --max-seq-len <n>. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference, dllama chat, and dllama-api. You don't need to set it in the worker because the root node will distribute the information to the worker.

Example:

./dllama chat --model ... --nthreads 8 --max-seq-len 1024

Assets 2

28 Jul 14:29

b4rtaz

v0.10.1

2339746

0.10.1

Implemented the fallback implementation for the matmulQ40vQ80 operation. Distributed Llama now supports all CPU architectures, with optimizations specifically for ARM and AVX2 CPUs.

Assets 2

25 Jul 11:54

b4rtaz

v0.10.0

4b8a0ca

0.10.0

This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).

Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context

The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference, dllama worker, and dllama-api commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.

How to run Llama 3.1 8B

Download Distributed Llama repository and compile it: make dllama && make dllama-api.
Download model python launch.py llama3_1_8b_instruct_q40
Run model:
- ./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999 or
- ./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999

If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc argument.

./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8

TODO

~~A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.~~

The 0.10.2 version introduced the --max-seq-len <n> argument.

Assets 2

12 Jul 20:10

b4rtaz

v0.9.2

90d7ebd

0.9.2

This version allows to override the chat template. This may be helpful if a model does not have a tokenizer with a chat template.

How to use:

./dllama... --chat-template llama3
./dllama-api ... --chat-template llama3

Supported values:

llama2
llama3
zephyr
chatml

Assets 2

01 Jun 14:22

b4rtaz

v0.9.1

08b4bcf

0.9.1

The --weights-float-type argument is optional now for models converted by using a converter from the 0.9.1 version or above.

Assets 2

01 Jun 12:18

b4rtaz

v0.9.0

a5e0445

0.9.0

This version introduces breaking changes in the tokenizer format. Now the tokenizer contains the whole chat template in the huggingface format.

Breaking changes

You need to regenerate your tokenizer. The tokenizer from the 0.8.0 version won't work with the 0.9.0 version.

Assets 2

31 May 20:49

b4rtaz

v0.8.0

6eccd30

0.8.0

This version introduces a new tokenizer format that includes configuration for chat functionality. With this update, Distributed Llama can support various models that operate in chat mode. The new tokenizer is required for these modes:

dllama chat ...
dllama-api ...

Breaking changes

The above change requires the tokenizer file to be regenerated. For Llama 3, you need to rerun the convert-tokenizer-llama3.py script. For other models, the process is a bit more complicated; please refer to this post for detailed instructions.

Assets 2

29 May 22:15

b4rtaz

v0.7.4

dc997b4

0.7.4

dllama-api:

Resolved a problem with adding unwanted <|eot_id|> to the response.
Introduced the naive cache that speeds up the inference in a chat client like AnythingLLM (demo).

Assets 2

27 May 21:15

b4rtaz

v0.7.3

df1d360

0.7.3

This version adds Windows support. 🎉🎉🎉 Thanks @DifferentialityDevelopment!

Additionaly this version introduces the limit of nodes: nSlices <= nKvHeads. More details here.

Contributors

DifferentialityDevelopment

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run Llama 3.1 8B

TODO

Contributors

Releases: b4rtaz/distributed-llama

0.10.3

0.10.2

0.10.1

0.10.0

How to run Llama 3.1 8B

TODO

0.9.2

0.9.1

0.9.0

0.8.0

0.7.4

0.7.3

Contributors