Skip to content

Commit

Permalink
Merge 'origin/master' into hipblas
Browse files Browse the repository at this point in the history
  • Loading branch information
SlyEcho committed Apr 28, 2023
2 parents a1caa48 + 0b2da20 commit 3b4a531
Show file tree
Hide file tree
Showing 10 changed files with 862 additions and 56 deletions.
2 changes: 1 addition & 1 deletion .devops/tools.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ elif [[ $arg1 == '--all-in-one' || $arg1 == '-a' ]]; then
echo "Skip model quantization, it already exists: ${i/f16/q4_0}"
else
echo "Converting PTH to GGML: $i into ${i/f16/q4_0}..."
./quantize "$i" "${i/f16/q4_0}" 2
./quantize "$i" "${i/f16/q4_0}" q4_0
fi
done
else
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ build-em/
build-debug/
build-release/
build-static/
build-cublas/
build-no-accel/
build-sanitize-addr/
build-sanitize-thread/
Expand Down
123 changes: 104 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,27 @@

Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++

**Warnings**

- `Q4_2` and `Q4_3` are still in development. Do not expect any kind of backward compatibility until they are finalized

**Hot topics:**

- [New quantization methods](https://github.com/ggerganov/llama.cpp#quantization)
- [Added LoRA support](https://github.com/ggerganov/llama.cpp/pull/820)
- [Add GPU support to ggml](https://github.com/ggerganov/llama.cpp/discussions/915)
- [Roadmap Apr 2023](https://github.com/ggerganov/llama.cpp/discussions/784)

## Description

The main goal of llama.cpp is to run the llama model using 4-bit quantization on a MacBook.
The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quantization on a MacBook

- Plain C/C++ implementation without dependencies
- Apple silicon first-class citizen - optimized via ARM NEON and Accelerate framework
- AVX2 support for x86 architectures
- Mixed F16 / F32 precision
- 4-bit quantization support
- 4-bit integer quantization support
- Runs on the CPU

This was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022) - I have no idea if it works correctly.
Please do not make conclusions about the models based on the results from this implementation.
For all I know, it can be completely wrong. This project is for educational purposes.
New features will probably be added mostly through community contributions.
The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
Since then, the project has improved significantly thanks to many contributions. This project is for educational purposes and serves
as the main playground for developing new features for the [ggml](https://github.com/ggerganov/ggml) library.

**Supported platforms:**

Expand Down Expand Up @@ -167,15 +163,27 @@ cd llama.cpp

### Build

Note: For Windows, CMake or Zig can be used.
In order to build llama.cpp you have three different options.

1. Use `make`
- Using `make`:
- On Linux or MacOS:

```bash
make
```
```bash
make
```

1. Use CMake
- On Windows:

1. Download the latest fortran version of [w64devkit](https://github.com/seeto/w64devkit/releases).
2. Extract `w64devkit` on your pc.
3. Run `w64devkit.exe`.
4. Use the `cd` command to reach the `llama.cpp` folder.
5. From here you can run:
```bash
make
```

- Using `CMake`:

```bash
mkdir build
Expand All @@ -184,12 +192,71 @@ Note: For Windows, CMake or Zig can be used.
cmake --build . --config Release
```

1. Use Zig
- Using `Zig`:

```bash
zig build -Drelease-fast
```

### BLAS Build

Building the program with BLAS support may lead to some performance improvements in prompt processing using batch sizes higher than 32 (the default is 512). BLAS doesn't affect the normal generation performance. There are currently three different implementations of it:
- Accelerate Framework:
This is only available on Mac PCs and it's enabled by default. You can just build using the normal instructions.

- OpenBLAS:

This provides BLAS acceleration using only the CPU. Make sure to have OpenBLAS installed on your machine.

- Using `make`:
- On Linux:
```bash
make LLAMA_OPENBLAS=1
```
Note: In order to build on Arch Linux with OpenBLAS support enabled you must edit the Makefile adding at the end of the line 105: `-lcblas`

- On Windows:

1. Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).
2. Download the latest version of [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases).
3. Extract `w64devkit` on your pc.
4. From the OpenBLAS zip that you just downloaded copy `libopenblas.a`, located inside the `lib` folder, inside `w64devkit\x86_64-w64-mingw32\lib`.
5. From the same OpenBLAS zip copy the content of the `include` folder inside `w64devkit\x86_64-w64-mingw32\include`.
6. Run `w64devkit.exe`.
7. Use the `cd` command to reach the `llama.cpp` folder.
8. From here you can run:

```bash
make LLAMA_OPENBLAS=1
```

- Using `CMake` on Linux:

```bash
mkdir build
cd build
cmake .. -DLLAMA_OPENBLAS=ON
cmake --build . --config Release
```

- cuBLAS

This provides BLAS acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed. You can download it from your Linux distro's package manager or from here: [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads).
- Using `make`:
```bash
make LLAMA_CUBLAS=1
```
- Using `CMake`:
```bash
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
```
### Prepare Data & Run
```bash
Expand All @@ -203,8 +270,8 @@ python3 -m pip install -r requirements.txt
# convert the 7B model to ggml FP16 format
python3 convert.py models/7B/
# quantize the model to 4-bits (using method 2 = q4_0)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin 2
# quantize the model to 4-bits (using q4_0 method)
./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
# run the inference
./main -m ./models/7B/ggml-model-q4_0.bin -n 128
Expand All @@ -223,6 +290,24 @@ As the models are currently fully loaded into memory, you will need adequate dis
| 30B | 60 GB | 19.5 GB |
| 65B | 120 GB | 38.5 GB |
### Quantization
Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
Model | F16 | Q4_0 | Q4_1 | Q4_2 | Q4_3 | Q5_0 | Q5_1 | Q8_0
-- | -- | -- | -- | -- | -- | -- | -- | --
7B (ppl) | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0617 | 6.0139 | 5.9934 | 5.9571
7B (size) | 13.0G | 4.0G | 4.8G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G
7B (ms/tok @ 4th) | 128 | 56 | 61 | 84 | 91 | 91 | 95 | 75
7B (ms/tok @ 8th) | 128 | 47 | 55 | 48 | 53 | 53 | 59 | 75
7B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
-- | -- | -- | -- | -- | -- | -- | -- | --
13B (ppl) | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.3234 | 5.2768 | 5.2582 | 5.2458
13B (size) | 25.0G | 7.6G | 9.1G | 7.6G | 9.1G | 8.4G | 9.1G | 14G
13B (ms/tok @ 4th) | 239 | 104 | 113 | 160 | 175 | 176 | 185 | 141
13B (ms/tok @ 8th) | 240 | 85 | 99 | 97 | 114 | 108 | 117 | 147
13B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
### Interactive mode
If you want a more ChatGPT-like experience, you can run in interactive mode by passing `-i` as a parameter.
Expand Down
32 changes: 26 additions & 6 deletions examples/quantize/quantize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,19 @@
#include "llama.h"

#include <cstdio>
#include <map>
#include <string>

static const std::map<std::string, enum llama_ftype> LLAMA_FTYPE_MAP = {
{"q4_0", LLAMA_FTYPE_MOSTLY_Q4_0},
{"q4_1", LLAMA_FTYPE_MOSTLY_Q4_1},
{"q4_2", LLAMA_FTYPE_MOSTLY_Q4_2},
{"q4_3", LLAMA_FTYPE_MOSTLY_Q4_3},
{"q5_0", LLAMA_FTYPE_MOSTLY_Q5_0},
{"q5_1", LLAMA_FTYPE_MOSTLY_Q5_1},
{"q8_0", LLAMA_FTYPE_MOSTLY_Q8_0},
};

// usage:
// ./quantize models/llama/ggml-model.bin models/llama/ggml-model-quant.bin type
//
Expand All @@ -12,11 +23,9 @@ int main(int argc, char ** argv) {

if (argc < 4) {
fprintf(stderr, "usage: %s model-f32.bin model-quant.bin type [nthread]\n", argv[0]);
fprintf(stderr, " type = %d - q4_0\n", LLAMA_FTYPE_MOSTLY_Q4_0);
fprintf(stderr, " type = %d - q4_1\n", LLAMA_FTYPE_MOSTLY_Q4_1);
fprintf(stderr, " type = %d - q4_2\n", LLAMA_FTYPE_MOSTLY_Q4_2);
fprintf(stderr, " type = %d - q4_3\n", LLAMA_FTYPE_MOSTLY_Q4_3);
fprintf(stderr, " type = %d - q8_0\n", LLAMA_FTYPE_MOSTLY_Q8_0);
for (auto it = LLAMA_FTYPE_MAP.begin(); it != LLAMA_FTYPE_MAP.end(); it++) {
fprintf(stderr, " type = \"%s\" or %d\n", it->first.c_str(), it->second);
}
return 1;
}

Expand All @@ -30,7 +39,18 @@ int main(int argc, char ** argv) {
const std::string fname_inp = argv[1];
const std::string fname_out = argv[2];

const enum llama_ftype ftype = (enum llama_ftype)atoi(argv[3]);
enum llama_ftype ftype;
if (argv[3][0] == 'q') {
auto it = LLAMA_FTYPE_MAP.find(argv[3]);
if (it == LLAMA_FTYPE_MAP.end()) {
fprintf(stderr, "%s: unknown ftype '%s'\n", __func__, argv[3]);
return 1;
}
ftype = it->second;
} else {
ftype = (enum llama_ftype)atoi(argv[3]);
}

int nthread = argc > 4 ? atoi(argv[4]) : 0;

const int64_t t_main_start_us = ggml_time_us();
Expand Down
85 changes: 85 additions & 0 deletions ggml-cuda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,23 @@ typedef struct {
} block_q4_3;
static_assert(sizeof(block_q4_3) == 2 * sizeof(ggml_fp16_t) + QK4_3 / 2, "wrong q4_3 block size/padding");

#define QK5_0 32
typedef struct {
__half d; // delta
uint8_t qh[4]; // 5-th bit of quants
uint8_t qs[QK5_0 / 2]; // nibbles / quants
} block_q5_0;
static_assert(sizeof(block_q5_0) == sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_0 / 2, "wrong q5_0 block size/padding");

#define QK5_1 32
typedef struct {
__half d; // delta
__half m; // min
uint32_t qh; // 5-th bit of quants
uint8_t qs[QK5_1 / 2]; // nibbles / quants
} block_q5_1;
static_assert(sizeof(block_q5_1) == 2 * sizeof(ggml_fp16_t) + sizeof(uint32_t) + QK5_1 / 2, "wrong q5_1 block size/padding");

#define QK8_0 32
typedef struct {
float d; // delta
Expand Down Expand Up @@ -142,6 +159,64 @@ static __global__ void dequantize_block_q4_3(const void * vx, float * y) {
}
}

static __global__ void dequantize_block_q5_0(const void * vx, float * y) {
const block_q5_0 * x = (const block_q5_0 *) vx;

const int i = blockIdx.x;

const float d = x[i].d;

const uint8_t * pp = x[i].qs;

uint32_t qh;
memcpy(&qh, x[i].qh, sizeof(qh));

for (int l = 0; l < QK5_0; l += 2) {
const uint8_t vi = pp[l/2];

const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;

const int8_t vi0 = ((vi & 0xf) | vh0);
const int8_t vi1 = ((vi >> 4) | vh1);

const float v0 = (vi0 - 16)*d;
const float v1 = (vi1 - 16)*d;

y[i*QK5_0 + l + 0] = v0;
y[i*QK5_0 + l + 1] = v1;
}
}

static __global__ void dequantize_block_q5_1(const void * vx, float * y) {
const block_q5_1 * x = (const block_q5_1 *) vx;

const int i = blockIdx.x;

const float d = x[i].d;
const float m = x[i].m;

const uint8_t * pp = x[i].qs;

const uint32_t qh = x[i].qh;

for (int l = 0; l < QK5_1; l += 2) {
const uint8_t vi = pp[l/2];

const int8_t vh0 = ((qh & (1 << (l + 0))) >> (l + 0)) << 4;
const int8_t vh1 = ((qh & (1 << (l + 1))) >> (l + 1)) << 4;

const int8_t vi0 = (vi & 0xf) | vh0;
const int8_t vi1 = (vi >> 4) | vh1;

const float v0 = vi0*d + m;
const float v1 = vi1*d + m;

y[i*QK5_1 + l + 0] = v0;
y[i*QK5_1 + l + 1] = v1;
}
}

static __global__ void dequantize_block_q8_0(const void * vx, float * y) {
const block_q8_0 * x = (const block_q8_0 *) vx;

Expand Down Expand Up @@ -178,6 +253,16 @@ void dequantize_row_q4_3_cuda(const void * vx, float * y, int k, cudaStream_t st
dequantize_block_q4_3<<<nb, 1, 0, stream>>>(vx, y);
}

void dequantize_row_q5_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
const int nb = k / QK5_0;
dequantize_block_q5_0<<<nb, 1, 0, stream>>>(vx, y);
}

void dequantize_row_q5_1_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
const int nb = k / QK5_1;
dequantize_block_q5_1<<<nb, 1, 0, stream>>>(vx, y);
}

void dequantize_row_q8_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
const int nb = k / QK8_0;
dequantize_block_q8_0<<<nb, 1, 0, stream>>>(vx, y);
Expand Down
2 changes: 2 additions & 0 deletions ggml-cuda.h
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ void dequantize_row_q4_0_cuda(const void * vx, float * y, int k, cudaStream_t st
void dequantize_row_q4_1_cuda(const void * vx, float * y, int k, cudaStream_t stream);
void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream);
void dequantize_row_q4_3_cuda(const void * vx, float * y, int k, cudaStream_t stream);
void dequantize_row_q5_0_cuda(const void * vx, float * y, int k, cudaStream_t stream);
void dequantize_row_q5_1_cuda(const void * vx, float * y, int k, cudaStream_t stream);
void dequantize_row_q8_0_cuda(const void * vx, float * y, int k, cudaStream_t stream);

#ifdef __cplusplus
Expand Down
Loading

0 comments on commit 3b4a531

Please sign in to comment.