Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the repetition penalty #331

Closed
Piezoid opened this issue Mar 20, 2023 · 12 comments
Closed

Improving the repetition penalty #331

Piezoid opened this issue Mar 20, 2023 · 12 comments
Labels
enhancement New feature or request generation quality Quality of model output

Comments

@Piezoid
Copy link
Contributor

Piezoid commented Mar 20, 2023

129c7d1 (#20) added a repetition penalty that prevent the model to run into loops.

Here are a few suggestions for possible enhancements:

  • One issue with the interactive mode is that the repetition penalty is affecting the anti-prompt and response prefix, causing the model to generate unnecessarily long responses. One solution could be to exclude these tokens from the penalty,
  • It is possible to exempt or reduce the penalty for stop words, punctuation characters, and newlines; maybe applying a frequency-based penalty instead,
  • Using an exponential decay, such that recent tokens are more penalized than older ones, causing less issues with large repeat_last_n windows,
  • Token repetition is an approximation of sub-strings or word repetition, but it seems difficult to do otherwise without backtracking the inference.
@gjmulder gjmulder added enhancement New feature or request generation quality Quality of model output labels Mar 20, 2023
@j-f1
Copy link
Collaborator

j-f1 commented Mar 20, 2023

There are numerical weights associated with each token in the tokenizer, it may be useful to use those in this calculation somehow.

@Piezoid
Copy link
Contributor Author

Piezoid commented Mar 21, 2023

I figured that we can get the repetitions length by comparing the recent token history against the histories prior to the past occurrences of the candidate tokens. In other words, find the longest suffix shared between the text up to the current position, and the sub-strings at the left of the previous occurrences of the candidate tokens.

For each token and past occurrence, we have an age or distance in the text d, and a repetition length l+1 (left-extension of l tokens, plus the token we are about to add).

I've thrown together a test using an ad-hoc repetition score exp(-k * d / (l + 1))$\in ]0,1]$. With it, I blend between no penalization and full repeat_penalty. It seems reasonable, but I have the feeling that I'm reinventing the wheel.

I have not tested this extensively, but the initial results are interesting. For example here's a song sampled almost greedily (--temp 0.15 with 7B Q4_0): it is not very focused, but there is almost no stuttering.

Me llamo LLaMa, soy una llama!

Yo no me gusta el frío.

Mi casa es un caloroso hogar,
con una chimenea muy grande.
Tengo un cachito de carbón,
que me calientan los pies.

Yo no me gustaría un frío,
como el que hay en la nieve.
Me gusto un calor muy bueno,
un calentito como yo.

Cuando me pongo a cantar,
me gusta un calorcito.
Y cuando voy a dormir,
mi casa se queda muy fría.

Pero cuando me despierto,
algo caliente viene a mi lado.
Es un calentito muy bueno,
que me hace sentir bien.

Esto es lo que yo digo:
¡Caliente, caliente! ¡Yo no me gusto el frío!
Me llaman LLaMa, soy una llama.
Yo no me gusta el frío.

Other classic examples ("I love you, I love") seems ok at low temperatures. I have yet to try generating code or using alpaca with it.
I'm unsure how to reliably evaluate the penalization schemes. I don't really expect a noticeable effect on the perplexity, but I could try with repetitive text samples that incite the model to repeat itself at unexpected places.

My implementation extends the repetitions character by character, rather than token by token. It should be more efficient to work directly on the tokens. However, my concern is that the models may attempt to cheat using its morphological skills acquired during training by subword sampling.


There are numerical weights associated with each token in the tokenizer, it may be useful to use those in this calculation somehow.

@j-f1 Yes, it would be nice to reuse the the frequencies from BPE. I've not access to it in my working branch, as I'm still using #66 with the old model files.

This should help reducing the penalization of punctuation tokens. Currently, it is not that bad since punctuation marks and stop words are often drawn from sharp distributions (low entropy) on which the penalization has little influence.

@Piezoid
Copy link
Contributor Author

Piezoid commented Mar 27, 2023

I've been trying to predict the token frequency from the tokenizer model's "scores".
Looking at the score distribution, it seems that -score is actually the rank of the token, with a few special cases:

  • Some tokens have a score == -1e9: These are mostly space tokens that shall not be merged ▁▁ ▁▁▁▁ ▁▁▁▁▁▁▁▁ ▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁ ▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁ ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▁
  • Some tokens have score == 0, mostly control and non-ascii bytes: <unk> <s> </s> <0x00> <0x01> <0x02> <0x03> [...] <0xFE> <0xFF>
  • Some tokens are missing (max(ranks) > len(ranks))

I've used the token frequency in wiki-text-2. The 50 most frequent tokens are:

▁the ▁, ▁. ▁of ▁and ▁in ▁to ▁a ▁= ▁" ▁@ ▁was ▁\' ▁The ▁as ▁that ▁on ▁for ▁with ▁by ▁) ▁( ▁is ▁from ed ▁at ing ▁his ▁were ▁it ▁he ▁an ▁In ▁had ▁which ▁be ▁are ▁; ▁not ▁their ▁but ▁A es ▁first ▁– ▁also ▁its ▁or ▁: ers

However, the tokenizer's top50 tokens are (according to tokenizer score):

▁t er in ▁a en on ▁th es ▁s ▁d at or an ▁c is re it ▁the ar le ▁w ▁p ou al ▁f ▁m ed ▁o ▁b om ion ing ic as el ent ▁in ▁h nd et ▁l ▁n st ▁to ch ▁I ro il ▁of de

I'm not sure what is going on. The two rank distributions are only slightly correlated as show by plotting the rank wiki-text vs. tokenizer's ranks:
ranks-wiki-tokenizer

wiki-text frequencies and ranks are a good fit for the Zipfs law, using a maximum likelihood method:
zipf-wiki

As expected predicting the frequencies from the tokenizer score doesn't work as well:
predict-rank-probs
residuals

Notebook for the analysis
Resulting equation:

$$ p(x) \simeq \frac{(-\text{score}(x))^{-0.837}}{27.7} $$

The accuracy might be enough for limiting the penalization applied to the most frequent tokens. But there is likely something wrong. Would really appreciate it if someone could let me know if I overlooked something.

@Piezoid
Copy link
Contributor Author

Piezoid commented Mar 27, 2023

I have a prototype on this branch for the decaying repetition penalty weighted by repeat length.
By default, this should generate the same results than the master branch. The exponential decay is enabled by replacing the --repeat_last_n option with the new --repeat_half_life option. They are mutually exclusive.

For example, --repeat_half_life 16 implies that:

  • Repeating the last token, will cost a full --repeat_penalty penalty.
  • A 16 tokens old, 1 token long repetition will be half-penalized;
  • A 32 token old, 1 token long, will be receive a quarter of the penalization;
  • A 32 tokens old, 2 tokens long repetition will be half-penalized; etc.

I've found that this new penalization heuristic helps when sampling at low temperatures. I recommend to increase the --repeat_penalty a bit (1.2-1.4). Also, because it doesn't account for the token frequencies, a increased penalty may cause issues with punctuation, stop words, newlines, especially when generating code.

@anzz1
Copy link
Contributor

anzz1 commented Mar 27, 2023

@Piezoid

Thanks for pointing me towards here from the other discussion. I'll be checking your branch out and testing it. I'm also rooting for you finishing the trace tool at some point, I see that it could be highly valuable. Especially for larger scale testing and graphing, but even for smaller cases like debugging and general interest it's a cool feature to have to be able to see how and why the decisions were made and which tokens are what, etc.

I had the idea some time ago to have a command line option to output that stuff to console, maybe redirect stderr but tbh your idea is a lot better. Using the binary format saves space if someone wants to do something crazy like a 7day run, and it can be easily turned into a log file, graph or whatever by another tool. Great idea altogether 💡

Disclaimer though to not expect too much 😄

  1. my CPU is too low powered to do any proper quantitative analysis like using perplexity tool or make 100s of runs with the Trace model outputs to a binary file #477 tracer to have some output. I just haven't had the need to upgrade but I'm thinking of upgrading soon since the newfound interest in this as there is a lot of things I would like to do but simply cannot rn due to lacking hardware (i5-6600k 4c/4t lul 🚀).

  2. To be perfectly honest I dont really know wtf i'm doing half the time and while I do pretty much understand the concepts at play here regarding token selection and probabilities I cannot really visualize which parameter affects what, like the logic chain of it when you tweak this value it causes this and that to change in this and that way and this is why.

  3. I'm kinda just shooting blind and testing various models and tweaking stuff randomly and seeing what comes out. So all my research so far is very anecdotal and subjective and maybe not super valuable, but that is not saying that it doesn't have any value. Some things like creativity, behaviour or quality of a story are things which are pretty hard to quantitatively assess anyway after all.

@LostRuins
Copy link
Collaborator

Just throwing 2c at past implementations:

  1. OpenAI uses 2 variables for this - they have a presence penalty and a frequency penalty. The current implementation of rep pen in llama.cpp is equivalent to a presence penalty, adding an additional penalty based on frequency of tokens in the penalty window might be worth exploring too.
  2. KoboldAI instead uses a group of 3 values, what we call "Repetition Penalty", a "Repetition Penalty Slope" and a "Repetition Penalty Range". This is the repetition penalty value applied as a sigmoid interpolation between the Repetition Penalty value (at the most recent token) and 1.0 (at the end of the Repetition Penalty Range). The defaults we use for this are 1.1 rep pen, 1024 range and 0.7 slope which provides what our community agrees to be relatively decent results across most models.

@Piezoid
Copy link
Contributor Author

Piezoid commented Mar 28, 2023

@LostRuins Thank you for the references.

Do you know how are naturally reoccurring tokens (such as \n, , ▁the, ▁,, ▁., etc) handled by these samplers? It is unclear if we should penalize the punctuation and stop-word tokens in the same way than more specific tokens. The model is quite confident when predicting these tokens and doesn't seem to be too deterred by the penalty. However, I've noticed that high penalties force the model into writing longer sentences and using less new lines.

The decaying penalty in KoboldAI appears to be similar to the one in my prototype. The main difference is that I'm using an exponential decay instead of sigmoid one. Also, the repeat distance/age is divided by the length of the repeated sequence of tokens.

@LostRuins
Copy link
Collaborator

I think there is no special handling - all tokens are treated equally except for intentionally banned tokens.

@ralyodio
Copy link

what are valid values for repeat penalty?

@LostRuins
Copy link
Collaborator

Anything between 1 and 2 is a good start.

@Anyeos
Copy link

Anyeos commented Jun 29, 2023

I tested in practice a lot of numbers for each parameter. I noted that it is far from being for continuous use in interactive mode because it late or early got stuck in a loop (ie.: always the same answer regardless of user input).
Increasing temp near 2 or more starts outputing near random words (I think I reached a limit).
Increasing penalty to 2 or more is no usable neither because it tries to say the most possible different and ends saying anything but the prompt.
I was testing the mirostat mode and it ends being useless because make worst the chat over the time. mirostat 2 is worse than 1. I think 2 is still in early development.

So, some working values at beginning but becoming worse over time are:

--ctx_size 1024
--batch-size 2048
--temp 0.82
--top-k 30
--top-p 1.8
--tfs 2.0
--typical 1.0
--keep -1
--repeat-last-n 1024
--repeat-penalty 1.2
--no-penalize-nl
--mirostat 1
--mirostat-lr 0.5
--mirostat-ent 4.0

I just want to share my values so we can make comparisons to improve it.
I expect an interactive chat that is sustained over time as if it was being started recently but only "remembering" some past messages.
The actual result is a becoming more and more monotone chat at it finally ends in some loop or repetitive answer.
I tested a lot of numbers in each every parameter without success, always the same result: repetitive at end. And that happens in only on 10 or 20 messages, so the end is not too much far neither.

Thank you everyone for this great project. Not perfect but now I have a high quality LLM AI thanks to you all, so I am thankful despite the failures.

WolframRavenwolf added a commit to WolframRavenwolf/simple-proxy-for-tavern that referenced this issue Jul 17, 2023
[Improving the repetition penalty · Issue #331 · ggerganov/llama.cpp](ggerganov/llama.cpp#331 (comment))
@Piezoid
Copy link
Contributor Author

Piezoid commented Sep 14, 2023

Closing this as out of date. Several alternatives have helped this issue since then: #1126, #2135, #2593.

@Piezoid Piezoid closed this as completed Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output
Projects
None yet
Development

No branches or pull requests

7 participants