Vulkan Bugfixes and Improvements #7084

0cc4m · 2024-05-05T05:57:27Z

Apologies for the wait since the last update, I was rather busy.

Here are a number of bugfixes that should hopefully fix incoherence that the Vulkan backend showed for a while now.

I also modified the MMV shaders to run batches in a single call instead of multiple calls, this might improve performance on devices with a higher shader invocation overhead.

Finally, there's further work towards running MoE models with Vulkan included, but the mul_mat_id code is not ready yet.

…for single call batch operation

… Vulkan code

MaggotHATE · 2024-05-05T11:13:07Z

Thank you for the update! Unfortunately, I'm seeing gibberish on a 10.7B model at Q5_K_S (a framkenmerge, sure, but it works just fine on cpu and clblast) with a long initial prompt (~2700 tokens). Parameters are n_ctx 4096, n_batch 2048, u_batch 512. Setting n_batch to 4096 doesn't help. Another model, 7b at q6_K, gives correct output on Vulkan with these settings.

Additionally, there's a very noticeable delay after detecting Vulkan device on Win 10 (new system, still 1060 3GB), which was hardly noticeable on Win 8.1. That, however, might or might not be caused by a singe graphics output device on my new system (previous CPU had an iGPU, which wasn't used, though).

Finally, there's still a huge difference in memory consumption. It seems like the difference for VRAM is even larger now: on that 10.7B model, 9 layers with clblast occupy 1792 MB, while 7 layers with Vulkan occupy 2524 MB. Also, it uses ~300 MB of shared VRAM with any number of layers.

With no_kv_offload Vulkan now uses even more shared VRAM, which probably makes sense (previously it just used RAM - or maybe it's just Windows's quirks).

At the same time, the difference in speed between this and clblast is even bigger, Vulkan is really fast both in prompt processing and token generation.

0cc4m · 2024-05-05T12:44:23Z

Thank you for the update! Unfortunately, I'm seeing gibberish on a 10.7B model at Q5_K_S (a framkenmerge, sure, but it works just fine on cpu and clblast) with a long initial prompt (~2700 tokens). Parameters are n_ctx 4096, n_batch 2048, u_batch 512. Setting n_batch to 4096 doesn't help. Another model, 7b at q6_K, gives correct output on Vulkan with these settings.

Can you give me a link to the model that's not working? If I can reproduce the source of the incoherence I can hopefully fix it.
Does it work without this PR?

Additionally, there's a very noticeable delay after detecting Vulkan device on Win 10 (new system, still 1060 3GB), which was hardly noticeable on Win 8.1. That, however, might or might not be caused by a singe graphics output device on my new system (previous CPU had an iGPU, which wasn't used, though).

This is most likely shader compilation happening. The GPU driver should cache the shaders, so it should only be slow once with each update and fast on subsequent launches.

Finally, there's still a huge difference in memory consumption. It seems like the difference for VRAM is even larger now: on that 10.7B model, 9 layers with clblast occupy 1792 MB, while 7 layers with Vulkan occupy 2524 MB. Also, it uses ~300 MB of shared VRAM with any number of layers.

This shouldn't have changed compared to without this PR. It's expected that Vulkan uses more VRAM for layers since much more of the model is offloaded. The CLBlast backend basically only runs the matrix multiplication on the GPU and nothing else.

With no_kv_offload Vulkan now uses even more shared VRAM, which probably makes sense (previously it just used RAM - or maybe it's just Windows's quirks).

Shared VRAM is most likely the staging buffers for copying data to and from the GPU. Disabling KV offload means that the KV cache resides in RAM (shared VRAM is RAM), so that's expected behavior.

At the same time, the difference in speed between this and clblast is even bigger, Vulkan is really fast both in prompt processing and token generation.

Thanks for testing! Did the speed improve for you compared to without this PR?

daniandtheweb · 2024-05-05T12:53:17Z

For me on a Radeon 5700XT the performance is almost the same as the main branch, it's just a little bit slower: 264 t/s on main, 260 t/s pr for prompt processing and 33 t/s on main, 30 t/s pr for generation on llama 3 q5_k_m

MaggotHATE · 2024-05-05T14:36:42Z

Thanks for testing! Did the speed improve for you compared to without this PR?

Yes, but it's only noticeable on the start. The average is not so impressive for generation: for example, on 7b Q6_K model it goes from 5.033 t/s to 5.064 t/s. Still, it was a 611 tokens result, so the usual slowdown diminishes the improvement.

Processing of that ~2700t instruct: on 10.7B at Q5_K_S (10 layers offloaded) it was 35.895 t/s, with this PR it's 37.755 t/s.
On 7B at Q6_K (9 layers offloaded) it was 29.942 t/s, with this PR it's 30.833 t/s.

so it should only be slow once with each update and fast on subsequent launches.

Ok, I see now: it happened all the time because I was alternating between clblast and vk versions. Also, maybe it's because is uses all available memory (even though it doesn't look like it on graphs)

It seems to cache each version's shaders separately, because launching one doesn't speedup launching the other. Also, the mainline compiles faster, but it's not a big deal.

Can you give me a link to the model that's not working?

https://huggingface.co/mradermacher/Fimbulvetr-10.7B-v1-i1-GGUF - I'm still testing messages reloading in my program, and for some reason this model became a good benchmark for that. I'm not sure of the quality or the changes imatrix brought.

Does it work without this PR?

No, the same gibberish happened. Interestingly, while trying to test it, I struggled to even run the model with that large instruct. I had to increase the amount of layers from 9 to 10 to make it work. It's like a sweetspot - not higher, not lower, exactly 10.

0cc4m · 2024-05-05T17:04:58Z

Can you give me a link to the model that's not working?

https://huggingface.co/mradermacher/Fimbulvetr-10.7B-v1-i1-GGUF - I'm still testing messages reloading in my program, and for some reason this model became a good benchmark for that. I'm not sure of the quality or the changes imatrix brought.

Does it work without this PR?

No, the same gibberish happened. Interestingly, while trying to test it, I struggled to even run the model with that large instruct. I had to increase the amount of layers from 9 to 10 to make it work. It's like a sweetspot - not higher, not lower, exactly 10.

I downloaded the q5_k_s version of that model you linked and it runs fine for me across AMD and Nvidia GPUs. Not sure what's going on on your end. Which GPU are you using?

MaggotHATE · 2024-05-05T18:34:16Z

Which GPU are you using?

Same 1060 3GB, and the issue happens on a large initial instruct only. It works just fine with a typical Alpaca instruct or similar.

MaggotHATE · 2024-05-05T19:40:11Z

Update: gibberish just happened on a ~1100t prompt. I wanted to try setting n_ubatch to 2048, but it's too much memory for my setup (16GB RAM). Same on mainline and this PR.
UPD: in case it helps, the ~2700t instruct is this, modified for the model format and with a text included in it.

teleprint-me · 2024-05-07T03:06:18Z

I have a question, probably not related, but @0cc4m is the only one that can really answer it. When I use train-text-from-scratch, I see a noticable improvement compared to CPU, but I do not see any GPU usage when I do train. Is there any reason why this is case?

Also, as an aside, the initial implementations allowed me to offload most of the layers to GPU without any hiccups, but now I'll crash if I allocate too many layers to the GPU. This has led me to switching between the CPU and GPU for different tasks. Does this have anything to with with your previous PR where you modified how the layers were handled?

I have narrowed down a general bug to the GLFW backend that is unrelated to llama.cpp, so I'm not sure if it's related or not. Still haven't pinned it down yet. This shouldn't be an issue in the near future because I plan on replacing my RX 580 with either to a RTX 4060 ti or a 7900 XT, haven't decided yet.

Regardless, just curious if you're able/willing to provide any insights?

netrunnereve · 2024-05-08T02:11:10Z

I did some quick tests with my W8100 and didn't really see any improvements or regressions. Honestly after getting my CPU server I've been using Vulkan less and less since my GPU is really only good for 7B models and Command R 30B and Llama 70B completely blow away the small ones.

PR:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp 512	95.55 ± 0.30
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg 128	11.26 ± 0.03
llama 8B Q6_K	5.53 GiB	7.24 B	Vulkan	99	pp 512	71.30 ± 0.29
llama 8B Q6_K	5.53 GiB	7.24 B	Vulkan	99	tg 128	6.03 ± 0.08

Master:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp 512	93.49 ± 0.43
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg 128	11.75 ± 0.06
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	pp 512	69.26 ± 0.33
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	tg 128	6.09 ± 0.05

ggml-vulkan.cpp

0cc4m · 2024-05-08T20:29:01Z

Update: gibberish just happened on a ~1100t prompt. I wanted to try setting n_ubatch to 2048, but it's too much memory for my setup (16GB RAM). Same on mainline and this PR. UPD: in case it helps, the ~2700t instruct is this, modified for the model format and with a text included in it.

I can't seem to reproduce that issue. But n_batch > 512 is definitely broken, I'll take a look at what's going on there.

0cc4m · 2024-05-08T20:38:51Z

I have a question, probably not related, but @0cc4m is the only one that can really answer it. When I use train-text-from-scratch, I see a noticable improvement compared to CPU, but I do not see any GPU usage when I do train. Is there any reason why this is case?

To be honest, I have no idea what train-text-from-scratch does and I'd be surprised if it can use Vulkan (assuming that's what you meant)

Also, as an aside, the initial implementations allowed me to offload most of the layers to GPU without any hiccups, but now I'll crash if I allocate too many layers to the GPU. This has led me to switching between the CPU and GPU for different tasks. Does this have anything to with with your previous PR where you modified how the layers were handled?

Do you mean for running a model with main? VRAM use might have changed in later versions. Can you give me more details on what worked, what didn't/doesn't and on what hardware?

I have narrowed down a general bug to the GLFW backend that is unrelated to llama.cpp, so I'm not sure if it's related or not. Still haven't pinned it down yet. This shouldn't be an issue in the near future because I plan on replacing my RX 580 with either to a RTX 4060 ti or a 7900 XT, haven't decided yet.

Regardless, just curious if you're able/willing to provide any insights?

What GLFW backend?

teleprint-me · 2024-05-08T22:43:48Z

@0cc4m

To be honest, I have no idea what train-text-from-scratch does and I'd be surprised if it can use Vulkan (assuming that's what you meant)

That is what I meant. There is a definite 3x speed up. A 3 hour training session is dramatically reduced to a ~40-50 minute training session. Every time. I didn't think it would work, but tried it out to see.

I use make for CPU build and make LLAMA_VULKAN=1 for GPU.

Any insights into why this might be the case @ggerganov? I don't know/understand enough about the implementation details.

Do you mean for running a model with main? VRAM use might have changed in later versions. Can you give me more details on what worked, what didn't/doesn't and on what hardware?

I've been avoiding using GPU as much as I can lately because it keeps crashing my entire system.

I would use -ngl n where n would be the number layers related to the model. I would use Mistral 7B v0.2 and offload 32 layers and it would work fine up until a certain point where it would just obviously slow down. GPU would be maxed out by that point. Usually 16 layers was okay, even for my little 8GB VRAM.

I have plenty of CPU RAM, but it's not ideal for back prop.

What GLFW backend?

I feel this is out of scope, but it does affect the AMDGPU DRM for the RX 5xx series and the Vulkan backend has crashed in a similar fashion while using llama.cpp. Part of the reason it's been difficult to trace and isolate.

I'll have to do some thorough tests then when I'm not so deep into my work. Too many projects open at once ATM to risk it. GLFW#2493. There are a few of other identified bugs related to the RX 5xx series with the mesa graphics drivers.

They wanted me to test a AUR package, but I haven't had time to test yet.

teleprint-me · 2024-05-09T00:03:49Z

I tried reproducing the results since I had to do a upgrade, but I can't with the latest commit. I'm so shuffled at the moment, I think I might of lost track. Vulkan backend does not seem to affect training text from scratch. I must've been mistaken. Sorry for wasting any time. I'll post if I figure it out.

teleprint-me · 2024-05-09T04:48:47Z

I just wanted to add some relevant input and for this branch. I've been experimenting with it and it cleared up a few issues.

No random crashing. Some flickering. This is most likely due to poor driver support at this point. The code seems more stable overall though.
No more "gibberish" is be produced by my models. My models are outputting quality content on the GPU.
I've been running inference and have noticed improved runtime.
I'm able to offload layers as needed.

I'm sure this is a mixture of things, but these differences are extremely noticeable when compared to the master branch. My 30M param model just zips with it as well.

I think you're right about train from scratch not supporting this. I'll have to dig into this some more. Would be nice if it did work.

0cc4m · 2024-05-09T07:10:07Z

@ggerganov @slaren If I run inference on Mistral 7B q4_k_s with batch size 2048 and context size 2048 (and default ubatch size, which should be 512), I get GET_ROWS calls with src[1]->ne[0] == 0, empty src1. That causes validation issues in Vulkan, cause it leads to calls with dispatch ranges of 0. I can try to catch those, but it propagates through the graph.

The call I used to run into this is build/bin/main -f input_long.txt -c 2048 -n 512 --ignore-eos -m models/airoboros-m-7b-3.1.2.Q4_K_S.gguf -ngl 1000. input_long.txt is a file with a 1737 token prompt.

Is that correct behavior that I need to work around, or is that a bug in the model code?

slaren · 2024-05-09T07:54:41Z

That's normal behavior since #6122, the backend should skip zero size tensors. The Vulkan backend was also updated in that PR, but maybe there are other cases.

0cc4m · 2024-05-09T08:28:46Z

That's normal behavior since #6122, the backend should skip zero size tensors. The Vulkan backend was also updated in that PR, but maybe there are other cases.

Alright, thank you. I'll make sure they are skipped properly. After that fix this PR should be ready to merge.

0cc4m · 2024-05-09T14:24:35Z

I found and fixed the issue. I'll wait for the CI to finish and then merge this PR.

MaggotHATE · 2024-05-09T18:37:47Z

Gibberish is fixed! Thank you for the update @0cc4m !

0cc4m added 8 commits April 5, 2024 14:17

Modify mat mat mul shader for mul_mat_id, modify mat vec mul shaders …

2423c29

…for single call batch operation

Merge remote-tracking branch 'origin/master' into 0cc4m/vulkan-moe

a124dfa

Merge remote-tracking branch 'origin/master' into 0cc4m/vulkan-moe

1e46fa8

Further work towards MoE, disabled for now

3098206

Disable MoE code (not ready yet), fix a number of bugs in shaders and…

f3dc704

… Vulkan code

Add softmax with f16 mask and pos buffer support

6697303

Disable mul_mat_id shaders for now

947df47

Fix flake8

d4abd65

ggerganov reviewed May 8, 2024

View reviewed changes

ggml-vulkan.cpp Show resolved Hide resolved

ggerganov approved these changes May 8, 2024

View reviewed changes

mofosyne added bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 9, 2024

Fix validation errors caused by empty buffers on larger batch sizes

d2088cd

0cc4m merged commit befddd0 into master May 9, 2024
60 checks passed

0cc4m deleted the 0cc4m/vulkan-improvements branch May 9, 2024 18:39

MaggotHATE mentioned this pull request Jun 18, 2024

Refactor Vulkan backend to allow multiple contexts #7961

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Bugfixes and Improvements #7084

Vulkan Bugfixes and Improvements #7084

0cc4m commented May 5, 2024

MaggotHATE commented May 5, 2024

0cc4m commented May 5, 2024

daniandtheweb commented May 5, 2024

MaggotHATE commented May 5, 2024

0cc4m commented May 5, 2024

MaggotHATE commented May 5, 2024

MaggotHATE commented May 5, 2024 •

edited

Loading

teleprint-me commented May 7, 2024 •

edited

Loading

netrunnereve commented May 8, 2024

0cc4m commented May 8, 2024

0cc4m commented May 8, 2024

teleprint-me commented May 8, 2024 •

edited

Loading

teleprint-me commented May 9, 2024

teleprint-me commented May 9, 2024

0cc4m commented May 9, 2024

slaren commented May 9, 2024

0cc4m commented May 9, 2024

0cc4m commented May 9, 2024

MaggotHATE commented May 9, 2024

Vulkan Bugfixes and Improvements #7084

Vulkan Bugfixes and Improvements #7084

Conversation

0cc4m commented May 5, 2024

MaggotHATE commented May 5, 2024

0cc4m commented May 5, 2024

daniandtheweb commented May 5, 2024

MaggotHATE commented May 5, 2024

0cc4m commented May 5, 2024

MaggotHATE commented May 5, 2024

MaggotHATE commented May 5, 2024 • edited Loading

teleprint-me commented May 7, 2024 • edited Loading

netrunnereve commented May 8, 2024

0cc4m commented May 8, 2024

0cc4m commented May 8, 2024

teleprint-me commented May 8, 2024 • edited Loading

teleprint-me commented May 9, 2024

teleprint-me commented May 9, 2024

0cc4m commented May 9, 2024

slaren commented May 9, 2024

0cc4m commented May 9, 2024

0cc4m commented May 9, 2024

MaggotHATE commented May 9, 2024

MaggotHATE commented May 5, 2024 •

edited

Loading

teleprint-me commented May 7, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading