Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling : add XTC sampler #9742

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

Conversation

MaggotHATE
Copy link
Contributor

@MaggotHATE MaggotHATE commented Oct 4, 2024

This is my implementation of the XTC sampler as described by @p-e-w in oobabooga/text-generation-webui#6335 , the main difference is threshold_max parameter which seems to help with finetunes. Additional "thank you" goes to @LostRuins who implemented XTC a bit earlier than me (probabilities + logits system is a bit confusing at first).

XTC is a novel sampler that turns truncation on its head: Instead of pruning the least likely tokens, under certain circumstances, it removes the most likely tokens from consideration.

Technically, this sampler:

  • checks for probability xtc_p
  • searches for all tokens with probabilities above threshold xtc_t and below upper threshold xtc_t_max
  • penalizes and removes all such tokens besides the one least probable of them.

This means that at least one "most probable" token is always left untouched, which should keep things stable. For more details and explanations it is better to read @p-e-w 's original PR as it is very thorough and comprehensive.

Additionally, I have not encountered problems with eos tokens that were described in discussions of the original implementation, so I have not added any checks for it.

Being a unique sampler, XTC is not included in the queue by default (like Mirostats)

In my experience, the effect varies between models and even sizes of models. XTC improves creativity greatly, but may break models in specific cases (for example, multi-language capabilities of Nemo can be slightly worsened for inflective languages). With larger models XTC may require very careful choice of parameters - I have seen more refusals in Mistral Large 123b with default XTC settings, for example (although, not on my implementation, but through Kobold.cpp).

In general, however, XTC is a really interesting idea and a sampler worth testing.

Adds XTC sampler, not activated by default, but recommended settings by default.
To be more inline with the original implementation, chance is calculated once at the beginning.
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the code base, but XTC is a very interesting idea to me so I took a read through to see how you implemented it, and left a few comments in case they are helpful. In case not, feel free to ignore. :)

common/arg.cpp Outdated Show resolved Hide resolved
common/common.h Show resolved Hide resolved

std::random_device rd;
float chance = (float)(rd()%100)/100;
if (chance > ctx->probability) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite trivial, but it seems to me like this won't exactly produce the probability distribution the user asks for. chance could have values 0.00 through 0.99 in increments of 0.01, which means if the user specifies probability = .5 for example, only the 49/100 cases of 0.51 through 0.99 would match, rather than what I would personally expect to be 50.

Or in an extreme case, if the user specifies 0.99, they might expect XTC sampling to occur almost all the time but not quite all the time, whereas I think this would make it happen 100% of the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've resolved this now, although I may have misunderstood it. Are chances good now?

Copy link
Collaborator

@slaren slaren Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_device should only be used to initialize other RNGs, and we want the results to be repeatable when the seed is the same. You should keep a std::mt19937 RNG in the sampler state in the same way that other samplers such as llama_sampler_dist do it. Take a seed parameter and reset the RNG using this seed in the reset function. In this function, use the RNG in the state with std::uniform_real_distribution to obtain a number between 0 and 1.

Copy link
Contributor Author

@MaggotHATE MaggotHATE Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren Thank you for direction! Though, I am not sure I understood how llama_sampler_dist is organized. Can you please take a look at the latest changes?

UPD. No, it seems like the chances are always the same. I'm not sure how to work with this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reset is only called when completely resetting the sampler to its original state, so the current implementation will only generate the chance once, which is probably not what you want. This is what I meant:

diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
index 4372b40c..2b368081 100644
--- a/src/llama-sampling.cpp
+++ b/src/llama-sampling.cpp
@@ -1069,9 +1069,8 @@ struct llama_sampler_xtc {

     const uint32_t seed;
     uint32_t       seed_cur;
-    float          chance;

-    std::mt19937 rng;
+    std::mt19937   rng;
 };

 static const char * llama_sampler_xtc_name(const struct llama_sampler * /*smpl*/) {
@@ -1079,7 +1078,7 @@ static const char * llama_sampler_xtc_name(const struct llama_sampler * /*smpl*/
 }

 static void llama_sample_xtc_apply(struct llama_sampler * smpl, llama_token_data_array * cur_p) {
-    const auto * ctx = (llama_sampler_xtc *) smpl->ctx;
+    auto * ctx = (llama_sampler_xtc *) smpl->ctx;

     if (ctx->probability <= 0.0f
         || ctx->threshold <= 0.0f
@@ -1090,8 +1089,10 @@ static void llama_sample_xtc_apply(struct llama_sampler * smpl, llama_token_data
         || ctx->min_keep <= 2) {
         return;
     }
-    // chance is calculated on init and on each reset
-    if (ctx->chance > ctx->probability) return;
+
+    std::uniform_real_distribution<float> distance(0.0f, 1.0f);
+    float chance = distance(ctx->rng);
+    if (chance > ctx->probability) return;

     // in case it's not sorted/recalculated yet
     llama_sampler_softmax_impl(cur_p);
@@ -1141,9 +1142,6 @@ static void llama_sampler_xtc_reset(struct llama_sampler * smpl) {
     auto * ctx = (llama_sampler_xtc *) smpl->ctx;
     ctx->seed_cur = get_rng_seed(ctx->seed);
     ctx->rng.seed(ctx->seed_cur);
-
-    std::uniform_real_distribution<> distance(0.0, 1.0);
-    ctx->chance = distance(ctx->rng);
 }

 static struct llama_sampler_i llama_sampler_xtc_i = {
@@ -1157,9 +1155,6 @@ static struct llama_sampler_i llama_sampler_xtc_i = {

 struct llama_sampler * llama_sampler_init_xtc(float p, float t, float t_max, size_t min_keep, uint32_t seed) {
     auto seed_cur = get_rng_seed(seed);
-    std::uniform_real_distribution<> distance(0.0, 1.0);
-    auto rng = std::mt19937(seed_cur);
-    float chance = distance(rng);
     return new llama_sampler {
         /* .iface = */ &llama_sampler_xtc_i,
         /* .ctx   = */ new llama_sampler_xtc {
@@ -1169,8 +1164,7 @@ struct llama_sampler * llama_sampler_init_xtc(float p, float t, float t_max, siz
             /* .min_keep      = */ min_keep,
             /* .seed          = */ seed,
             /* .seed_cur      = */ seed_cur,
-            /* .chance        = */ chance,
-            /* .rng           = */ rng,
+            /* .rng           = */ std::mt19937(seed_cur),
         },
     };
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I assumed that ctx should stay const in _apply! Thanks, I definitely misunderstood that.


int removed = 0;
// going through all candidates from back to front, easier to keep the last of probables
for (int i = (cur_p->size - 1); i >= 0; --i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this, and std::sort below going to be kind of slow? It seems like there is probably a trick to avoid iterating over (and then sorting) all candidates, like going through high to low probability and stopping early, and sorting only part of the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this version since XTC is not meant to be used before truncating samplers. The recommended sequence is "min-p" -> "xtc", so the amount of candidates should not be too long. In practice, I have not noticed it affecting tg speed even with no prior truncation. Plus, this version keeps the code a bit cleaner.

As for sorting, it is the same as in llama_sampler_softmax_impl. Your idea is good though, I will look into that.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite sure performance will be visibly affected if there is a high baseline speed and a large vocabulary (some models have 200k tokens in their vocabulary nowadays). If you are seeing 80 t/s normally, then there's no way you can loop over 200k candidates 80 times per second on most hardware, in addition to all the other TG work.

The TGWUI implementation uses PyTorch vector operations for high performance; maybe something equivalent can be done here.

Copy link
Contributor Author

@MaggotHATE MaggotHATE Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some models have 200k tokens

Normally that amount of tokens will be pruned first by min_p: in my tests it narrowed down to 50-60 tokens at largest. My logics in this implementation was the following: most often we'll get around 3 to 4 candidates, and even if we have only 2 penalizable candidates, that would result in almost the same about of iterations going in one loop as going in two loops (find penalizable ones from the front, then go through them and actually penalize). The more penalizable candidates there are, the more effective it is to go in one loop.

Sorting is inevitable here, so I will try to optimize it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried replacing sort with a simple shift, but I'm not sure if it's faster (should be, given std::sort's complexity). Additionally, it presents a problem of duplicated tokens at the back if min_keep is above 0.

Is it a better solution?

    int found = 0;
    int pos = 0;
    // going through all candidates from back to front, easier to keep the last of probables
    for (int i = (cur_p->size - 1); i >= 0; --i) {
        if (cur_p->data[i].p >= ctx->threshold && cur_p->data[i].p <= ctx->threshold_max) {
            ++found;
            if (found > 1) {
                // .logits are used for sorting and calculating .p in llama_sample_softmax_impl
                cur_p->data[i].logit = -999.0f;
                pos = i;
            }
        }
    }

    if (found > 1) {
        int removed = found - 1;
        size_t size_new = cur_p->size - removed;

        for (size_t i = pos; i < size_new - pos; ++i) {
            cur_p->data[i] = cur_p->data[i + removed];
        }

        if (size_new < ctx->min_keep) size_new = ctx->min_keep;
        cur_p->size = size_new;
    }

Copy link
Contributor Author

@MaggotHATE MaggotHATE Oct 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've found a solution, it should be faster than the previous one. At this point, however, it might be worth reworking the first for loop.

Copy link
Contributor

@jukofyork jukofyork Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at:

oobabooga/text-generation-webui#6335

Can't you just use almost the same logic as min_p:

    // if the cur_p aren't sorted, try the unsorted implementation first
    if (!cur_p->sorted) {
        std::vector<llama_token_data> filtered_tokens;

        float max_logit = -FLT_MAX;
        for (size_t i = 0; i < cur_p->size; ++i) {
            max_logit = std::max(max_logit, cur_p->data[i].logit);
        }
        const float min_logit = max_logit + logf(ctx->p); // min logit for p_i >= p * p_max

        for (size_t i = 0; i < cur_p->size; ++i) {
            if (cur_p->data[i].logit >= min_logit) {
                filtered_tokens.push_back(cur_p->data[i]);
            }
        }

        // if we have enough values the operation was a success
        if (filtered_tokens.size() >= ctx->min_keep) {
            memcpy(cur_p->data, filtered_tokens.data(), filtered_tokens.size()*sizeof(llama_token_data));
            cur_p->size = filtered_tokens.size();
            min_p_applied = true;
        }
    }

but use cur_p->data[i].logit < min_logit to place into filtered_tokens, and then keep track of a single best_token / best_logit pair (ie: the current best that exceeds min_logit by the smallest amount). Then at the end prepend best_logit before the memcpy if found?

Unless I've misunderstood the XTC logic then this seems the simplest way to do it and maintains the sorted status if already sorted in the same way as the min_p logic?

Copy link
Contributor Author

@MaggotHATE MaggotHATE Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jukofyork While batch-testing in my program I noticed consistently shorter results. Out of worry that it was caused by my current implementation of XTC (it was not, I haven't updated to the latest fixes of the new backend registry) I rewrote it with memcpy similar to what you've proposed:

    std::vector<llama_token_data> tokens_left;
    int pos = 0;

    for (size_t i = 0; i < cur_p->size; ++i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            if (i > 0 && cur_p->data[i-1].p >= ctx->threshold) tokens_left.emplace_back(cur_p->data[i-1]);
            tokens_left.emplace_back(cur_p->data[i]);
        } else if (pos == -1) pos = i;
    }

    // in case all candidates are penalizable
    if (tokens_left.size() == 0) tokens_left.emplace_back(cur_p->data[cur_p->size - 1]);

    size_t to_remove = cur_p->size - tokens_left.size();

    if (to_remove >= 2 && tokens_left.size() >= ctx->min_keep) {
        memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
        cur_p->size = tokens_left.size();
    }

Is it correct? Am I missing anything? UPD. yes I did, fixed.

src/llama-sampling.cpp Outdated Show resolved Hide resolved
@MaggotHATE MaggotHATE marked this pull request as draft October 4, 2024 15:19
@David-AU-github
Copy link

I have tested this with a number of models of different sizes / types via Text Generation / oobabooga on day of release.
This works very well and a strong addition to LLAMACPP.

@p-e-w
Copy link

p-e-w commented Oct 5, 2024

What is the theory behind the xtc_t_max parameter? It seems to me that in many cases, that parameter would do the opposite of what XTC is supposed to do: Reduce creativity compared to not using XTC at all.

Here's an example: Imagine the input is ... shiver down his, and the predictions from the model are

spine 0.65
back 0.15
throat 0.12
...

With the original XTC algorithm and xtc_t set to 0.1, spine and back will be eliminated, leaving throat and the rest of the distribution. The behavior is more creative because the most clichéd continuations are gone.

Now let's say you set xtc_t to 0.1 again, and you also set the xtc_t_max parameter from this PR to 0.6. In that case, the algorithm will eliminate only back, leaving spine because it is above the upper threshold. The probability mass of back is distributed to the remaining tokens after renormalization, and the most clichéd continuation (spine), the one we want to avoid above all else, is now even more likely than it would have been if we had done nothing at all.

That doesn't seem right.

@MaggotHATE
Copy link
Contributor Author

Here's an example: Imagine the input is ... shiver down his

These example and explanations are true for baseline models, but may not work for finetuned models that were trained against clichéd phrases.

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

At the same time XTC may still be useful for such a model if a middle range is carefully set. I would assume that it is possible to cut out a range where penalized clichés may still appear, further improving model's capabilities. In practice, I have tested only a few such models, but I was getting better (as in, less typical) responses with lowered upper threshold.

You can look at various benchmarks that estimate creativity and, more relevant in our case, so called "slop", which is exactly the type of phrases we want to avoid using XTC. It's worth testing XTC with such models too.

As a note, while testing new fixed random, I've encountered refusals on mini-magnum-12b-v1.1.Q8_0 where they never happened even on baseline Nemo Instruct, with parameters M=0.020->xtc=0.20-0.99 (99% chance) (and for statistics I was counting tokens that came through XTC sampler and that were removed: 611/37275(1.64%) ). While this is an extreme example (99% chance), it shows that it is possible to get an undesirable result in some corner cases. The addition of upper threshold simply makes XTC a bit more flexible without breaking it (default value is still 1.0).

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.
@p-e-w
Copy link

p-e-w commented Oct 6, 2024

Consider a model which, as a result of finetuning, is now less likely to output the phrase in your example: less likely doesn't mean "erased entirely", and such models may still be able to generate it, but with lower probability. In such case XTC will remove a more probable (supposedly less clichéd) variant and increase the chance of choosing a previously less probable cliché. In the long run it can steer such finetuned model back to its baseline, "undoing" the effect of finetuning to some extent.

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure? That's highly convoluted.

I suspect that this parameter is useless in most cases, actively harmful unless extremely carefully calibrated for each model and input, and above all else, confusing and very difficult to interpret for most users. The standard XTC threshold gives you a straightforward knob to tweak, with a clearly defined meaning: The probability that you consider "viable". By contrast, with xtc_t_max, you have to guess to what degree the finetune has already suppressed clichés, and that degree might not even be uniform across all clichés. Essentially, xtc_t_max means nothing at all. It's just a value that you can fiddle with in the hope that you will like the output better.

IMO, xtc_t_max is a philosophical mismatch for XTC, injecting complexity and volatility into an otherwise simple and straightforward sampler. I considered several additional parameters while designing XTC, but in the end decided against including any of them, because simplicity is king, and interpretability is what makes parameters useful in the first place.

@MaggotHATE
Copy link
Contributor Author

So the purpose of xtc_t_max is to act as a countermeasure so that the finetune's anti-cliché countermeasures don't get overridden by XTC, which is itself an anti-cliché countermeasure?

No, it a measurement of a model's perceived creativity. If a model is already creative enough, more clichés will be less probable - and the range of tokens we perceive as undesirable would have lower ceiling.

Look at it from a broader perspective: every model needs a direction to create an output a user might need. To achieve that, we can perform an internal work and an external one on that model.

Internal work is training/finetuning - if affects the internal data of the model, and the results of such work are "baked" into the model itself. While not usually used, it is possible for any well-made model to output coherently (almost) without samplers.

External work are sampling algorithms, RAG, etc. They are used to redirect the model towards a more desirable output for every input a user might have. We can also say that prompt engineering falls into this category (as opposed to creating training data pairs, for example, which would be internal work).

Now, let's look at XTC again: the initial idea was to remove the most "boring", repetitive tokens from consideration. This means that we see the results of the internal work done on the model as "incomplete" from what we see as "creativity" of a model. To make it "complete", we use XTC and get exactly (or almost exactly) what we need.

However, if the internal work done one a model has already considered that creativity might be "incomplete", and further training was performed, that model by itself will be closer to what we want to achieve. We see the base output of such model as "more creative" in comparison to another (usually, the base/instruct versions) model. This also means that undesirable tokens will have lower probabilities during sampling.

With the original XTC, however, we will still penalize top tokens only. There are no mechanisms to control it's behavior towards upper top tokens, because raising the bar affects lower top tokens, and changing the probabilities affects all top tokens. This means that we don't differentiate between models that are more creative and models that are less creative - lower threshold cannot reflect that as it controls lower top tokens, not the upper ones.

By introducing xtc_t_max we give a user more control over the perceived creativity, making it easier to find per-model settings. It has always been the case with presets which might be sufficient for one model, but not sufficient for another. Considering the large number of models nowadays, xtc_t_max is an important addition and a useful flexibility.

because simplicity is king

I would argue that functionality is the king: giving users more options and more control over the tools they use is more important than keeping algorithms simple. KISS is important, but it should not hurt usability. In this case: original XTC is a sampler that works with ranges of tokens from top only, with xtc_t_max it works with any range of tokens, from top or in the middle. Users who don't need to fiddle with it, will keep it on default, while those who want to experiment will be able to try middle ranges. Possibilities are doubled.

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.
@MaggotHATE MaggotHATE marked this pull request as ready for review October 6, 2024 11:27
@MaggotHATE
Copy link
Contributor Author

I have reworked the algorithm, given that std::sort is no longer used. It should be ready now.

@MaggotHATE MaggotHATE requested a review from slaren October 6, 2024 11:31
@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

After tinkering with the algorithm more I've tested two options:

The more elegant one is to iterate through all tokens from back to front, which makes capturing the last op token easier:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = false;

    for (int i = (cur_p->size - 1); i >= 0; --i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            tokens_left.emplace_back(cur_p->data[i]);
        } else if (!last_captured) {
            tokens_left.emplace_back(cur_p->data[i]);
            last_captured = true;
        }
    }

    if (last_captured) {
        int to_remove = cur_p->size - tokens_left.size();

        if (to_remove >= 1 && tokens_left.size() >= ctx->min_keep) {
            memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));

            cur_p->size = tokens_left.size();
            // since tokens are in reverse order, we'll need to sort them later
            cur_p->sorted = false;
        }
    }

The problem is, it requires sorting later, which will be performed by softmax later. That is expensive, and we can avoid it by going from the front:

    std::vector<llama_token_data> tokens_left;
    bool last_captured = true;

    for (size_t i = 0; i < cur_p->size; ++i) {
        if (cur_p->data[i].p < ctx->threshold || cur_p->data[i].p > ctx->threshold_max) {
            // capturing the last top token
            if (!last_captured && i > 0 && cur_p->data[i-1].p >= ctx->threshold) {
                tokens_left.emplace_back(cur_p->data[i-1]);
                last_captured = true;
            }

            tokens_left.emplace_back(cur_p->data[i]);
        }
    }

    // in case all candidates are penalizable
    if (cur_p->data[cur_p->size - 1].p >= ctx->threshold) tokens_left.emplace_back(cur_p->data[cur_p->size - 1]);

    if (tokens_left.size() < ctx->min_keep) return;

    size_t to_remove = cur_p->size - tokens_left.size();

    if (to_remove >= 1) {
        memcpy(cur_p->data, tokens_left.data(), tokens_left.size()*sizeof(llama_token_data));
        cur_p->size = tokens_left.size();
    }

However, the second one feels bulky. Which one is more acceptable? @slaren can I request your help here?

@slaren
Copy link
Collaborator

slaren commented Oct 7, 2024

If I understand the algorithm correctly, this implementation seems too complicated. Since the tokens are sorted by probability, you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list. The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

@MaggotHATE
Copy link
Contributor Author

MaggotHATE commented Oct 7, 2024

you should only need to first find the index of the last token to remove, and then move the remaining tokens to the top of the list

It will be slightly more complex due to addition of max threshold, but yes, I understand. I have also added XTC into test-sampling for self-check.

disable itself if any special tokens would be removed

It's not a part of the original idea, and I haven't encountered any problems with special tokens in llama.cpp specifically - that's why I left it out for now. If any problems are found later, I'll definitely add it. Thank you!

@pnb
Copy link
Contributor

pnb commented Oct 7, 2024

... The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

Wouldn't doing so bias the responses to be shorter, on average? If EOT is a special token and can be elevated in probability by XTC when other (more likely) tokens are removed, but never decreased in probability by itself being removed, it seems like responses will be shorter.

I've tried this implementation a bit now and it does occasionally ramble a bit at the end of a response, especially when generating code (not a great use for XTC but entertaining at least). That might be caused by removing EOT, so perhaps there's good reason to avoid it. But I would definitely prefer an occasional ramble over shortened responses.

@MaggotHATE
Copy link
Contributor Author

shortened responses

Interestingly, XTC can behave differently in that regard: for example, mini-magnum-12b-v1.1.Q8_0 started generating shorter responses with XTC cutting out middle ranges (0.05-0.50), while no such effect happened at full range (default 0.1 - 1.0).

At the same time, other sampling techniques can shorten prompts too - for example, noise as @kalomaze once implemented. When used with high range it can easily shorten responses in half (but also improve creativity...quite extremely). As such, this effect is not exclusive to XTC.

@pnb
Copy link
Contributor

pnb commented Oct 7, 2024

Interesting! I guess that makes sense for a model where EOT has a bimodal probability distribution, either quite likely or unlikely, and you cut out the middle part of the distribution where it is less commonly occurring.

@MaggotHATE MaggotHATE marked this pull request as draft October 7, 2024 20:18
@github-actions github-actions bot added the testing Everything test related label Oct 7, 2024
@p-e-w
Copy link

p-e-w commented Oct 8, 2024

@slaren

The implementation in textgen-webui also seems to completely disable itself if any special tokens would be removed, which I imagine is important.

That mechanism is not part of the original design, and doesn't exist in other implementations such as those in Kobold and ExLlamaV2. It was added by the TGWUI maintainer against my objections after a prolonged discussion prompted by some users complaining about undesired behavior with certain large models. Many others, including myself, have not experienced such problems, and I do not recommend adding this hack to llama.cpp.

@LostRuins
Copy link
Collaborator

I agree with p-e-w, and their original design is what kobold uses.

@p-e-w
Copy link

p-e-w commented Oct 8, 2024

@MaggotHATE

No, it a measurement of a model's perceived creativity. If a model is already creative enough, more clichés will be less probable - and the range of tokens we perceive as undesirable would have lower ceiling.

No, that interpretation doesn't work, because the model's innate creativity is context-dependent. It's impossible to say something like "this model places clichéd tokens below 60% probability, so we set xtc_t_max to 0.6". Such trained cliché-avoidance behavior will never be uniform. On the other hand, it's perfectly reasonable to say "any token that the model assigns a probability of 10% or greater is likely to make sense, so we set xtc_t to 0.1".

The xtc_t_max parameter has no plain interpretation, and getting the value wrong will reinforce clichéd behavior by shifting probability mass to the top token, something that can never happen with the original XTC parameters, no matter which values you choose. That's the main reason I consider xtc_t_max to be bad: If you pick the wrong value, XTC will suddenly do the opposite of what it is designed to do. In other words, it's a classic footgun. And no value below 1 is safe a priori; you always have to experiment, and a value so obtained doesn't carry over to any other model or input.

xtc_t_max is a hack that may allow certain individual users to adapt certain models to work slightly better than they would otherwise. It can also wreck XTC, and even invert its purpose, if set to a slightly wrong value, or if left unchanged from a previously good value after switching to a different model. It's unstable and unpredictable, and can end up doing the opposite of what the user wants. Such a parameter should not be exposed by software used by so many people, and no other implementation of XTC provides it. If it works well for you, then I recommend keeping it in a fork.

@MaggotHATE
Copy link
Contributor Author

@p-e-w

that interpretation doesn't work, because the model's innate creativity is context-dependent.

That context is built not only by the user, but also by the model itself: every previous token affects every next token, and that affects the whole result, making control over top tokens more important and nuanced (and is also why XTC is so important in general). If we only control the lower threshold, we wouldn't differentiate between "a baseline model used for creativity" and "a finetuned model used for creativity" - in both cases we would only assume that a baseline model needs a wider range than a finetuned model. That range would be always pinned to the top, which works for baseline, but may not for a finetune. The upper threshold, as an optional parameter, fixes that when applied in the latter case.

It's impossible to say something like "this model places clichéd tokens below 60% probability, so we set xtc_t_max to 0.6".

On the other hand, it's perfectly reasonable to say "any token that the model assigns a probability of 10% or greater is likely to make sense, so we set xtc_t to 0.1".

I see inconsistency here: we can equally say "most tokens that the model assigns a probability 60% or greater are likely not lead to cliched phrasing, so we set xtc_t_max to 0.6" as a perceived metric. It's neither precise nor guaranteed, but can help to achieve the main goal - penalizing tokens that lead to cliched phrases.

However, XTC doesn't recognize tokens (unlike some others) and can still penalize anything regardless of its value. All we have is a lower limit of token's probability, which makes XTC one-sided. By adding xtc_t_max we give it an additional side - it's not a hack, it's a fine-grained control. Without it XTC will be almost useless for a big number of models that were already finetuned against clichés.

The xtc_t_max parameter has no plain interpretation, and getting the value wrong will reinforce clichéd behavior by shifting probability mass to the top token

This assumes that top tokens always lead to clichés, which would be less true for finetuned models. Again, all these reasons are based on the normal state of things where top tokens are more likely to bring undesirable results. However, there are more finetuned models than baseline models, and xtc_t_max makes XTC more universal without breaking it. A combination of original XTC with some other samplers would not achieve a similar effect since XTC is the only one that controls from top to bottom (well, unless we start restricting tokens directly).

you always have to experiment, and a value so obtained doesn't carry over to any other model or input.

In other words, it's a classic footgun

which is why xtc_t_max is optional. It doesn't make XTC worse or function differently on default settings. Experimentation is a big part of LLMs even today, and having a more controllable version of XTC would allow that - again, without breaking anything on default settings.

Such a parameter should not be exposed by software used by so many people

A fair warning in README would be enough, I believe. XTC is turned off from the sampling queue, so users will have to look it up anyway.

and no other implementation of XTC provides it.

The same can be said about oobabooga's special tokens handling, and it's more intrusive that xtc_t_max. In the end, xtc_t_max doesn't break anything, doesn't change what XTC is about and doesn't encourage using it if user don't need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants