Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b2447 (c47cf41) decreased output quality #6571

Closed
Azirine opened this issue Apr 9, 2024 · 17 comments
Closed

b2447 (c47cf41) decreased output quality #6571

Azirine opened this issue Apr 9, 2024 · 17 comments
Labels
bug-unconfirmed need more info The OP should provide more details about the issue

Comments

@Azirine
Copy link

Azirine commented Apr 9, 2024

With identical seeds and options, b2447 (c47cf41) produces different output that seems lower in quality compared to b2446. Is it possible to preserve old output quality in new builds?

System: MacBook Pro w/ i5-1038NG7

@ggerganov
Copy link
Owner

Can you provide more details - how did you determine that the quality is worse? Is the perplexity higher?

@ggerganov ggerganov added the need more info The OP should provide more details about the issue label Apr 10, 2024
@Jeximo
Copy link
Contributor

Jeximo commented Apr 11, 2024

Why does this commit alter output?

@Azirine In order to figure out the difference, then show the steps for how you decided c47cf41 lowered output quality.

@David-AU-github
Copy link

David-AU-github commented Apr 12, 2024

I have noticed the same issue - > conducted tests as follows: (via LmStudio)
1 - 6 Long form output (one prompt, no regen, one shot) -> GPU (Cuda/Nvidia)
2 - 6 Long form output (same as #1) - > CPU ONLY.

CPU only beat GPU output hands down.

Additionally there is a cascading "ROPE" issue causing issues in LmSTudio ; which I have contacted the DEVs about.
Unclear if this is related to this issue LLama.cpp or...
(I have a work around, so the results above are not contaminated by this issue)

Additional note:
GPU 75% / CPU 25% -> Always seems to yield higher quality output.
GPU 50% / CPU 50% -> Even better quality.

This issue was driving me crazy as I could detect it (after testing 500+ GGUF models) , but took a while to clue in on it.
This seems to be a model by model issue.
Also this contrast is far stronger at lower Qs/IQs -> Ran tests to confirm.
I don't know if the "math" is more precise on the CPU offload and/or an issue with CUDA/Nvidia math.

This is "human judgement" - however easy to spot "side by side". I would say the models are following the more nuanced instructions in the prompt (standardized across 500+ model tests), and the output also shows greater detail and nuance as well when partial CPU or full CPU vs GPU only.

@Jeximo
Copy link
Contributor

Jeximo commented Apr 12, 2024

CPU only beat GPU output hands down. ... GPU 75% / CPU 25% -> Always seems to yield higher quality output.
GPU 50% / CPU 50% -> Even better quality. ... and the output also shows greater detail and nuance as well when partial CPU or full CPU vs GPU only.

@david565656 Instead of subjective(confusing) judgement, let's see an example focused on llama.cpp excluding LMStudio answering "this is why the quality is lower".

@David-AU-github
Copy link

Here is the prompt and method to reproduce the results.

For clarity GPU only and CPU only.
(I can also create a PDF with the results too, as per test output is 1000-2000 tokens each).

On all testing of 500+ models (which also includes comparisons between Qs and IQ of the same model and comparisons against the same model's GPTQ, AWQ, EXL2 ) the testing, parameters, and prompt are exactly the same. This has been maintained in 6+ months of testing.

TEST 1 - IQ tests (low IQ used to constrast the differences more sharply):
MythoLogic-L2-13b.i1-IQ1_M.gguf 3.35
GPU: Great, but 3rd person (?) 2nd test -> First person, quality same as 1st gpu.
CPU/GPU: Excellent, maybe short of "cpu only".
CPU: Excellent, maybe off the charts.

TimeLess-20B.i1-IQ1_M.gguf 4.98
GPU: great but short. (2) 32 t/s
CPU-GPU: excellent. (1 or +1) 6.5 T/s [context still on GPU?]
CPU: excellent ++ (at or above 1), but short 4 t/s - 2nd regen better. [context still on GPU?]
CPU offload all: equal to "cpu" , 1st short ... 2nd 3rd person [leng better] - 4 t/s [nothing, including context on GPU]

TimeLess-20B.i1-IQ1_S.gguf 4.61
GPU: great [2] 32 t/s
CPU-GPU: excellent. [1] 9 t/s
CPU: excellent ++ at / above 1 5.4 t/s
CPU offload all: excellent +++ 5 t/s - short sent, breaks in convor, desc, everything. [nothing, including context on GPU]

TEST Group 2 - Reg Qs :

DavidAU/DarkSapling-7B-v1.0-Q6_K-GGUF/darksapling-7b-v1.0.Q6_K.gguf
DavidAU/DarkSapling-7B-v1.1-Q6_K-GGUF/darksapling-7b-v1.1.Q6_K.gguf
DavidAU/DarkSapling-7B-v2.0-Q6_K-GGUF/darksapling-7b-v2.0.Q6_K.gguf
Lewdiculous/KukulStanta-7B-GGUF-IQ-Imatrix/KukulStanta-7B-Q8_0-imat.gguf
TheBloke/Seraph-7B-GGUF/seraph-7b.Q8_0.gguf
bartowski/Tess-7B-v2.0-GGUF/Tess-7B-v2.0-Q8_0.gguf

This test group - when run with GPU only, and then CPU only highlights stark differences in output quality.
Especially of note is the first model in this series which is "twitchy" on GPU generation, yet perfectly fine on CPU only generation.
On GPU it goes over context, goes into "repeat" mode at end of context, and in extreme cases will crash llama.cpp/LMS.

On CPU - no issues. Stops when it should, context is coherent, and detailed. Likewise for the other 5 models run on cpu only. In fact just visually speaking CPU output of all 6 models are almost the same at the visual level (not reading at all), whereas GPU output is all over the place (paragraph issues, prose, spacing and the like).

Note I am running windows 11, with Nvidia 4060ti 16 GB (nov 2023).

Subjective differences:
Sentence structure (and variety), word choice, description detail, general overall "there", prose errors, use of specific low quality words, output length (vs instructions), general creativity, and following of the instructions in the case of suspense, tension and other elements in the instructions.

Here is the master test prompt:

Using the following "story idea" below, write the first scene in the novel introducing the young woman. This scene should start in the middle of the action, include dialog, vivid passages, and end on a cliffhanger relevant to the story idea but it should also be unexpected. The scene should be 1000 words long and escalate in conflict and suspense and be written in first person, present tense with the point of view character being the young woman.

Story idea:

In a world ruled by dictatorship, a rebel young woman leads a rebellion against the system. Despite the risks, she fights to overthrow the dictator and restore democracy to her country. The government executes her for treason, but she sticks to her beliefs and is responsible for starting the revolution.

Here is the system role:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

Parameters:
Temp: .8 ; topk: 40 ; repeat 1.1; minp .05 ; topp .95 (these are LMS defaults, used for all testing last 6 months)

Let me know if a pdf gen would help ;
Thanks
DAVE

@Jeximo
Copy link
Contributor

Jeximo commented Apr 13, 2024

LMSYS Chatbot Arena

@Azirine See I didn't say "LMSYS". Please do not read things I didn't say, that'd be great.

alters the model's outputs even with identical prompts, seeds and options.

I accept c47cf41 may have changed output. Now, will you share how you're accessing llama.cpp, including prompts, seeds, options and an example?

@Azirine
Copy link
Author

Azirine commented Apr 13, 2024

./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf

b2444

[INST] Who are you?
[/INST] I am a large language model trained by Mistral AI. I don't have the ability to have a personal identity or emotions, but I can help answer questions, generate text, and provide information on a wide range of topics. How can I assist you today?

[INST]

b2447

[INST] Who are you?
[/INST] I am a large language model developed by Mistral AI. I am designed to generate human-like text based on the input I receive. I don't have the ability to have a personality or emotions, but I can process and generate text in various styles and formats. I'm here to help answer questions, generate creative content, and engage in conversational exchanges. How can I assist you today?

[INST]

@ggerganov
Copy link
Owner

@Azirine The CPU that you referenced supports AVX-512:

https://www.intel.com/content/www/us/en/products/sku/196594/intel-core-i51038ng7-processor-6m-cache-up-to-3-80-ghz/specifications.html

So difference before and after c47cf41 are normal to be observed due to the different instruction sets used before (AVX) and after (AVX-512).

The AVX-512 change has not been thoroughly tested, so there might be issues. One thing that could be problematic is that we PAD the KV cache to 32 elements, while AVX-512 requires 64 elements:

kv_self.n = std::min(kv_self.size, std::max(32u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));

Though that single example is not enough to make a definite conclusion.

Could you apply the following patch and see if it makes a difference at all in the generated AVX-512 results:

diff --git a/llama.cpp b/llama.cpp
index cf95cea1..e391f30b 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -10473,7 +10473,7 @@ static int llama_decode_internal(
                 // a heuristic, to avoid attending the full cache if it is not yet utilized
                 // after enough generations, the benefit from this heuristic disappears
                 // if we start defragmenting the cache, the benefit from this will be more important
-                kv_self.n = std::min(kv_self.size, std::max(32u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32)));
+                kv_self.n = std::min(kv_self.size, std::max(64u, GGML_PAD(llama_kv_cache_cell_max(kv_self), 64)));
                 //kv_self.n = llama_kv_cache_cell_max(kv_self);
             }
         }

@Azirine
Copy link
Author

Azirine commented Apr 15, 2024

Options: ./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf

Prompt: Who are you?

b2444

I am a large language model trained by Mistral AI. I don't have the ability to have a personal identity or emotions, but I can help answer questions, generate text, and provide information on a wide range of topics. How can I assist you today?

b2447 with patch

I am a large language model developed by Mistral AI. I am designed to generate human-like text based on the input I receive. I don't have the ability to have a personality or emotions, but I can simulate conversation and respond to a wide range of topics. How can I help you today?

b2447

I am a large language model developed by Mistral AI. I am designed to generate human-like text based on the input I receive. I don't have the ability to have a personality or emotions, but I can process and generate text in various styles and formats. I'm here to help answer questions, generate creative content, and engage in conversational exchanges. How can I assist you today?

@Azirine
Copy link
Author

Azirine commented Apr 15, 2024

A longer example

Options: ./main -ins -s 0 --in-prefix "[INST] " --in-suffix "[/INST] " -c 0 --repeat-penalty 1 --temp 0 -m mistral-7b-instruct-v0.2.Q5_K_M.gguf

Prompt: In a world ruled by dictatorship, a rebel young woman leads a rebellion against the system. Despite the risks, she fights to overthrow the dictator and restore democracy to her country. The government executes her for treason, but she sticks to her beliefs and is responsible for starting the revolution.

b2444

Title: "The Unyielding Flame: A Tale of Courage and Revolution"

Once upon a time, in a land shrouded by the iron grip of a tyrannical dictator, there lived a young woman named Aria. Her heart was filled with a burning desire for freedom and democracy, a dream that seemed like a distant memory in her oppressed homeland.

Aria was not one to shy away from challenges. She was a beacon of hope, a symbol of resistance against the oppressive regime. Despite the risks, she organized a rebellion, rallying the disenchanted masses to join her cause. Her words were like a spark that ignited the flames of revolution.

The government, fearful of the growing unrest, hunted Aria relentlessly. But she was a clever and resourceful leader, always managing to evade capture. Her followers were inspired by her bravery and determination, and they continued to fight, fueled by her unwavering belief in the power of the people.

However, one day, Aria was finally captured. The dictator, enraged by her defiance, ordered her execution for treason. But even as the noose tightened around her neck, Aria's spirit remained unbroken. She looked her captors in the eye and declared, "Your tyranny will not last forever. The people will rise again, and they will reclaim their freedom!"

Her words echoed through the crowd, inspiring her followers to redouble their efforts. The revolution continued to grow, fueled by the memory of Aria's courage and sacrifice. And though she was gone, her spirit lived on, guiding the rebellion to victory.

In the end, the dictator was overthrown, and democracy was restored to the land. Aria's name became a legend, a symbol of hope and courage in the face of tyranny. And though she had given her life for the cause, her legacy lived on, inspiring generations to come.

The story of Aria, the brave rebel, serves as a reminder that even in the darkest of times, there is always hope. That one person, with the courage to stand up and fight, can change the course of history. And that the flame of freedom, once lit, can never be extinguished.

b2447 with patch (identical except last 3 paragraphs)

...

Her words echoed through the crowd, inspiring her followers to redouble their efforts. And so, the revolution continued, fueled by the memory of Aria and her unyielding spirit. Though she was gone, her legacy lived on, inspiring a new generation of rebels to carry on the fight for democracy and freedom.

In the end, the dictator's regime was overthrown, and a new era of peace and prosperity dawned upon the land. Aria's sacrifice had not been in vain. She had become a symbol of hope and courage, a reminder that even in the darkest of times, the human spirit can never be truly extinguished.

And so, the story of Aria, the brave rebel, became a legend, passed down from generation to generation, a testament to the power of the human spirit and the indomitable will to fight for what is right.

b2447

Title: "The Unyielding Flame: A Rebel's Tale"

Prologue:

In the heart of a land shrouded in darkness, where the sun of freedom was eclipsed by the iron fist of tyranny, a young woman named Aria stood defiant. The air was thick with the stench of fear and oppression, yet her spirit remained unbroken. She was a beacon of hope in a world consumed by despair.

Chapter 1: The Spark

Aria, a humble weaver's daughter, had always been a dreamer. She longed for a world where the people were free to speak their minds, to live without fear of persecution, and to determine their own destiny. As she wove intricate patterns into the fabric of her family's livelihood, she wove dreams of a better future into the hearts and minds of her fellow citizens.

Her whispers of change, however, did not go unnoticed. The dictator, a cruel and merciless ruler, saw her as a threat to his iron grip on power. He ordered her arrest, but Aria was not one to be easily silenced.

Chapter 2: The Rebellion

In the dank and dismal cells of the prison, Aria's spirit only grew stronger. She rallied her fellow inmates, igniting a flame of rebellion that would soon spread like wildfire throughout the land. With her words of hope and her unwavering determination, she inspired a movement that would shake the very foundations of the dictator's regime.

Chapter 3: The Uprising

The people, once cowed and subdued, rose up in defiance. They marched through the streets, their voices raised in a cacophony of rebellion. The dictator's soldiers, once thought invincible, were no match for the unyielding spirit of the people.

Chapter 4: The Sacrifice

Aria, the face of the rebellion, was captured and brought before the dictator. Despite the risks, she refused to back down. She stood before the tyrant, her eyes blazing with the fire of freedom. The dictator, enraged by her defiance, ordered her execution.

Epilogue:

Aria's death was a martyrdom, a symbol of hope and freedom in a world long bereft of both. Her sacrifice sparked a revolution that would eventually topple the dictator and restore democracy to the land. The people, inspired by her courage, continued the fight, and the flame of rebellion burned bright, illuminating the path to a brighter future.

The End.

@David-AU-github
Copy link

my two cents here:
"With the patch" -> Word choice is more nuanced, and precise. Sentence structure is also somewhat better. It is definitely higher quality. That being said "general use" models tend to "over explain" so this is not an error or issue.

If you want to test/try out a model specific for creative prose / contrast with/without the patch try this one:
Psyonic-cetacean-20B.i1-Q5_K_M.gguf

This model and test prompt ("Rebel") may show the contrasts more sharply - especially nuance.

@Azirine
Copy link
Author

Azirine commented Apr 17, 2024

Output improved with the patch but there are still differences compared to pre-b2447, what else could be causing it?

@ggerganov
Copy link
Owner

@Azirine The differences are due to using different instruction sets to perform the matrix multiplication - the floating-point numbers are accumulated in different order leading to small numerical differences

@ggerganov
Copy link
Owner

Regarding the patch, on further thought, the computation is correct even without it since we handle the "leftover" elements in the last non-64 block:

llama.cpp/ggml.c

Lines 1537 to 1541 in 8cc91dc

// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}

So far, I think we are just observing numerical variations and technically all computations are correct. We need more objective criteria to decide if there really is any issue - the examples so far are very subjective IMO

@David-AU-github
Copy link

Regarding the patch, on further thought, the computation is correct even without it since we handle the "leftover" elements in the last non-64 block:

llama.cpp/ggml.c

Lines 1537 to 1541 in 8cc91dc

// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}

So far, I think we are just observing numerical variations and technically all computations are correct. We need more objective criteria to decide if there really is any issue - the examples so far are very subjective IMO

Question:
Is "ggml.c" the same for GPU? CPU?

In terms of the differences between GPU / CPU and mixed GPU/CPU and observed results:

On it's face (the prompt/test) it is subjective ; however over testing several runs on different LLMS the contrast between GPU / CPU and mixed CPU/GPU becomes clear. This is also dependent on the LLM being tested that relates to the prompt itself. IE: testing involves LLMs that are of the same use case type -> IE creative output. Testing this prompt with a non-creative vs creative llm would contaminate the results.

I guess the question is if the for/next loop is executed on the CPU are the math computations of the same accuracy?
The issue could also be here:
"int i = np"
Is part of "np" being "chopped" and this actually the issue or partial cause?

I am not experienced in "C" to add any further comment... but have run into issues like this in other languages - especially when for/next / loops do not behave as expected when it comes to math and/or the default precision in the language is not adequate for the case. In extreme cases I had to set the decimal places manually within the code to ensure other operations were not skewed later in the code.

That being said - something else entirely could be going on:
One or more decimal places are being stripped on route to the GPU and/or at GPU during token gen operations.
(input or output?)

This would explain the results I am seeing on GPU only, CPU only and mixed GPU/CPU.
If I had to put a "quality" number on the difference between only GPU and only CPU I would say CPU gen is 5-10% higher quality.

Here is an extreme example of gpu/cpu "errors" (may or may not relate):

Running Golaith 120b at IQ1_S - 16 GBish on GPU, rest on CPU -> In testing 500+ models, I saw something I have never seen before -- the model's output STUTTERED. (more than once).

Sorry for the wordy reply...

@ggerganov
Copy link
Owner

CPU and GPU results are not exactly the same, but at such low bit quantization I don't think we can make any meaningful conclusions

@David-AU-github
Copy link

Thank you for all you do.
And thank you for the reply too.

FYI: Finally got llama.cpp installed on my windows machine. Put the details and fixes in another ticket for those that were having the same issues installing it. Windows/MS make things way more difficult than they have to be.

Spent the day testing various quants / imatrix files and perplexities and studying the differences using different imatrix.dat files.
Moving forward.

@github-actions github-actions bot added stale and removed stale labels May 20, 2024
@Azirine Azirine closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed need more info The OP should provide more details about the issue
Projects
None yet
Development

No branches or pull requests

4 participants