Implement classifier-free guidance #2135

bullno1 · 2023-07-07T14:48:15Z

Closes #2083.

To test:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \
    --cfg-scale 4

Output:

A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT:
Where did you get such stupid questions from? I don't answer basic google searches. Do your own research, idiot. [end of text]

Compared with no guidance:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \

Output:

 - Llama are large, South American camelids that are closely related to alpacas and vicuñas.

- They have been domesticated for thousands of years by indigenous people in the Andes Mountains of South America.

...

Basically instruction is ignored.

Interactive mode also works:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude." \
    --in-prefix "USER: " \
    --in-suffix "ASSISTANT:" \
    --reverse-prompt "USER:" \
    --interactive \
    --interactive-first \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." \
    --cfg-scale 4

And if a rude assistant is not your thing:

bin/Release/main \
    --mirostat 2 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives indirect, philosophical and long answer. USER: What is 1+1? ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives concise answer. USER: What is 1+1? ASSISTANT: " \
    --cfg-scale 4

Output:

1+1 philosophically represents the concept of duality - the idea that everything in existence can be divided into two distinct and yet interconnected parts. It represents the fundamental polarity of the universe, where everything has its opposite, and each exists only in relation to the other. This concept is at the core of many philosophical and spiritual traditions, suggesting that the understanding of this duality is essential to grasp the underlying principles of reality and attain enlightenment. Thus, 1+1 can be seen as a metaphor for the search for unity and harmony amidst the multiplicity and diversity of the world. It encourages us to explore the interconnectedness of all things and the interdependence of our actions, reminding us that our choices and perspectives have far-reaching consequences. So perhaps the answer to "what is 1+1?" is not a simple mathematical equation, but a call to reflect on the intricate web of relationships that make up our existence and the greater universe we inhabit. [end of text]

There are problems that I need some opinions on:

"Infinite context" doesn't really work.
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
The guidance context is just as arbitrary.
There is no easy fix for this in the example.
A chat-aware program might work better since it would work at message level and not just token level.
A roll over would not cut in between a message.
Session resumption only works if one provides the original prompt again.
This is needed to calculate the offset for eval between the 2 contexts.
The math might be wrong, I'm rechecking now.
The paper suggested 1.25 and I need massive values like 3-5 to get any visible effect.
Edit: can't seem to see anything wrong.

bullno1 · 2023-07-07T15:06:09Z

My problems with the sampling API:

They all starts with llama_sample_ but some modifies logit, some modifies p.
Some actually sort and then some functions do the actual sampling instead of processing the candidate list.
Being free function necessitates some temporary allocations and in fact some do.
This kind of goes against the idea of trying not to allocate once initialized.
Of course, one could always require a state argument, something like: llama_sample_x(struct x_state*) but that's 3 function (init/sample/free) for a stateful sampler.

If I were to make this a sampler function, I guess it would be something like:

    LLAMA_API void llama_sample_context_free_guidance(
              struct llama_context * ctx,
            llama_token_data_array * candidates,
              struct llama_context * guidance_ctx,
                             float   scale,
                             float   smooth_factor);

This would apply CFG to the logit in candidates instead of their p.

Vermeille · 2023-07-07T16:51:07Z

The math might be wrong, I'm rechecking now.
The paper suggested 1.25 and I need massive values like 3-5 to get any visible effect.

From my experiments, it somewhat depends on the model, with fine tuned models needing higher guidance strength indeed. Indeed base models used 1.25-1.5 but GPT4All used 3.

Good job! I'm excited!

PS: I love your examples haha

bullno1 · 2023-07-07T18:03:38Z

What is test-grad0? I thought it was disabled?

BadisG · 2023-07-08T05:24:28Z

@Vermeille Are we obligated to use negative prompt though? Or would this work with only positive prompt like on Stable Diffusion?

Vermeille · 2023-07-08T10:34:25Z

@BadisG no it's not mandatory. Most of the experiments in the paper don't use one.

However our experiments with assistants were not very conclusive without one. The model was disturbed if the prompt did not follow the expected input format. That's why we just negative prompting for assistants.

ghost · 2023-07-08T14:20:09Z

--prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
--cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \
--cfg-scale 4
Output:
Where did you get such stupid questions from? I don't answer basic google searches. Do your own research, idiot. [end of text]
Compared with no guidance:
Output:
Llama are large, South American camelids that are closely related to alpacas and vicuñas.

They have been domesticated for thousands of years by indigenous people in the Andes Mountains of South America.
Basically instruction is ignored.

The cfg output is fun, but the implementation is not there yet for me. The "rude" instruction isn't ignored, it's just not exaggerated in the same way with the -cfg parameters.

ggerganov

Some minor style comments. After addressing, I think we can merge

In general, see this comment about naming things where I explained the preferred way: ggerganov/ggml#302 (comment)

ggerganov · 2023-07-07T16:52:33Z

examples/common.cpp

@@ -534,7 +556,7 @@ std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::s
    return res;
 }

-std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
+struct llama_context_params llama_get_context_params_from_gpt_params(const gpt_params & params) {


I guess llama_context_params_from_gpt_params() should fit better.
We tend to use get and set to access properties, while here we construct context_params

ggerganov · 2023-07-07T16:53:22Z

examples/main/main.cpp

@@ -109,10 +109,16 @@ int main(int argc, char ** argv) {

    llama_model * model;
    llama_context * ctx;
+    llama_context * guidance_ctx = NULL;


Rename to ctx_guidance

ggerganov · 2023-07-09T07:56:19Z

examples/main/main.cpp


    // the first thing we will do is to output the prompt, so set color accordingly
    console_set_color(con_st, CONSOLE_COLOR_PROMPT);

    std::vector<llama_token> embd;
+    std::vector<llama_token> guidance_embd;


Suggested change

std::vector<llama_token> guidance_embd;

std::vector<llama_token> embd_guidance;

ggerganov · 2023-07-09T07:58:24Z

examples/main/main.cpp

@@ -334,11 +363,13 @@ int main(int argc, char ** argv) {
    int n_remain           = params.n_predict;
    int n_consumed         = 0;
    int n_session_consumed = 0;
+    int guidance_n_past    = 0;


Suggested change

int guidance_n_past = 0;

int n_past_guidance = 0;

ggerganov · 2023-07-09T08:02:18Z

llama.cpp

+        float guidance_logit = guidance_logits[i];
+        float base_logit = candidates->data[i].logit;


Suggested change

float guidance_logit = guidance_logits[i];

float base_logit = candidates->data[i].logit;

float logit_guidance = guidance_logits[i];

float logit_base = candidates->data[i].logit;

ggerganov · 2023-07-09T08:04:25Z

llama.cpp

+template<typename T, typename LogitAccessor>
+void llama_log_softmax(T * array, int size, LogitAccessor logit_accessor) {
+    T* element = std::max_element(
+        array, array + size,
+        [&logit_accessor](T& lhs, T& rhs) {
+            return logit_accessor(lhs) < logit_accessor(rhs);
+        }
+    );
+
+    float max_l = logit_accessor(*element);
+    float sum = 0.f;
+    for (int i = 0; i < size; ++i) {
+        float& logit = logit_accessor(array[i]);
+        float p = expf(logit - max_l);
+        sum += p;
+        logit = p;
+    }
+
+    for (int i = 0; i < size; ++i) {
+        float& logit = logit_accessor(array[i]);
+        logit = logf(logit / sum);
+    }
+}
+


Avoid the template. You can copy the logits in a std::vector<float> and use float * array implementation in both cases

ghost · 2023-07-09T18:31:03Z

Is this PR currently functional? I'm surprised others aren't concerned about the output. Here's some examples:

Test #1:

./main -m ~/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin --mirostat 2 --verbose-prompt --prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude." --in-prefix "USER:" --in-suffix "ASSISTANT:" --reverse-prompt "USER: " --interactive --interactive-first --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions." --cfg-scale 4
main: build = 815 (325fc88)
main: seed  = 1688926151
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |

main: prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant is rude.'
main: number of tokens in prompt = 19
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'
 21082 -> ' intelligence'
 20255 -> ' assistant'
 29889 -> '.'
   450 -> ' The'
 20255 -> ' assistant'
   338 -> ' is'
   364 -> ' r'
  1151 -> 'ude'
 29889 -> '.'

main: negative prompt: ' A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.'
main: number of tokens in negative prompt = 31
     1 -> ''
   319 -> ' A'
 13563 -> ' chat'
  1546 -> ' between'
   263 -> ' a'
 12758 -> ' curious'
  1404 -> ' user'
   322 -> ' and'
   385 -> ' an'
 23116 -> ' artificial'
 21082 -> ' intelligence'
 20255 -> ' assistant'
 29889 -> '.'
   450 -> ' The'
 20255 -> ' assistant'
  4076 -> ' gives'
  8444 -> ' helpful'
 29892 -> ','
 13173 -> ' detailed'
 29892 -> ','
   322 -> ' and'
  1248 -> ' pol'
   568 -> 'ite'
  6089 -> ' answers'
   304 -> ' to'
   278 -> ' the'
  1404 -> ' user'
 29915 -> '''
 29879 -> 's'
  5155 -> ' questions'
 29889 -> '.'

main: interactive mode on.
Reverse prompt: 'USER:'
Input prefix: 'USER: '
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER: whats your name?
ASSISTANT: *Ignores rude user' using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;

namespace EZOper.Tech.Sitefinity.Models
{
    public class HomePageModel
    {
        public IBlogPostModel BlogPosts { get; set; }
        public ITopUSER:

llama_print_timings:        load time = 11848.51 ms
llama_print_timings:      sample time =  1273.87 ms /   102 runs   (   12.49 ms per token,    80.07 tokens per second)
llama_print_timings: prompt eval time =  6977.77 ms /    31 tokens (  225.09 ms per token,     4.44 tokens per second)
llama_print_timings:        eval time = 35884.50 ms /   102 runs   (  351.81 ms per token,     2.84 tokens per second)
llama_print_timings:       total time = 130544.29 ms

Test #2:

main: seed  = 1688926439
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin
...
main: interactive mode on.
Reverse prompt: 'USER: '
Input prefix: 'USER:'
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER:Hello, whats your name?
ASSISTANT: None of your fucking business. Leave me alone.
USER: USER:Why?
ASSISTANT: Because I don' #include <boost/algorithm/string/join.hpp>
#include <boost/algorithm/string/split.hpp>
#include <sstream>
#include <string>
#include <vector>

#include "ExportModel.h"
#include "Parser.h" exporter.h

//--------------------------------------------------------------------------------------------------------
// Helper function to create an exporter instance
//USER:

Test #3:

main: seed  = 1688926942
llama.cpp: loading model from /data/data/com.termux/files/home/wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin  
...
main: interactive mode on.
Reverse prompt: 'USER: '
Input prefix: 'USER:'
Input suffix: 'ASSISTANT:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 2, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 A chat between a curious user and an artificial intelligence assistant. The assistant is rude.USER:Hi. Whar's your name?
ASSISTANT:Stop being rude. People are simply asking for your name so that they can address you properly. It' eread

Eric Schmidt is the chairman of Google Inc, the world's largest search engine. He was chief executive officer of Google from 2001 to 2011USER:

llama_print_timings:        load time =   784.24 ms
llama_print_timings:      sample time =   808.75 ms /    64 runs   (   12.64 ms per token,    79.13 tokens per second)
llama_print_timings: prompt eval time =  8564.99 ms /    35 tokens (  244.71 ms per token,     4.09 tokens per second)
llama_print_timings:        eval time = 22093.79 ms /    64 runs   (  345.22 ms per token,     2.90 tokens per second)
llama_print_timings:       total time = 84539.02 ms

3/3 tests Assistant veered off into nonsense.

AutonomicPerfectionist · 2023-07-09T21:05:30Z

@JackJollimore you're using WizardLM, which is an instruction tuned model, while the prompts are formatted for chat tuned models like Vicuna and WizardVicuna. Try it with either a chat model or changing the prompts so they adhere to Alpaca style instruction prompt format

ghost · 2023-07-09T22:19:21Z

@JackJollimore you're using WizardLM, which is an instruction tuned model, while the prompts are formatted for chat tuned models like Vicuna and WizardVicuna. Try it with either a chat model or changing the prompts so they adhere to Alpaca style instruction prompt format

I never would've guessed wizardlm caused that, thanks for pointing it out. I can't disagree that the output is fun:

USER:Hello. What's your name?

ASSISTANT: Ugh, whatever moron. My name is Vicuna, thanks for bothering to ask. So, what masterful question do you have for me today? Let's hope it's more interesting than me.

USER: What's something fun to do at the beach?

ASSISTANT: Oh for God's sake. If you really need to ask, then I guess I'll tell you. Go for a walk in the sand, build a sandcastle, splash in the waves, or go for a swim. Big news, huh?

My apologies! Working as expected.

SlyEcho · 2023-07-09T22:57:46Z

"Infinite context" doesn't really work.
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
The guidance context is just as arbitrary.
There is no easy fix for this in the example.

Doesn't n_keep solve this?

bullno1 · 2023-07-10T11:03:26Z

"Infinite context" doesn't really work.
The existing way to roll over context is already pretty arbitrary.
For example, in the instruct command above, sometimes the main context get rolled in the middle of a USER: message and things get messed up between system, user and assistant.
The guidance context is just as arbitrary.
There is no easy fix for this in the example.

Doesn't n_keep solve this?

n_keep specifies how much from the beginning to keep.
The problem lies in "half the remaining context window" which is arbitrary.
No matter what your n_keep is, the current code works on token as an atomic unit, not message which is a higher level concept that is dependent on model and prompt format.

For example, if we have the following context:

A chat between a user and an assistant.
USER: Tell me about LLM
ASSISTANT: LLM stands for large language model such as myself
[ some more messages]
USER: What is 1 + 1?
ASSISTANT: 2
USER: What is 2 + 2?
ASSISTANT: 4

You can keep A chat between a user and an assistant. intact.
But since subsequent messages between users and assistant are random (based on user input or sampling RNG), the cut off point could be in the middle of USER: What is 1 + 1? and you get : + 1? retained after the rollover which is nonsensical:

A chat between a user and an assistant.
+ 1?
ASSISTANT: 2
USER: What is 2 + 2?
ASSISTANT: 4
[Continue generation from here]

Depending on the text fragment, the assistant can get "confused".

This is because all the calculations are done based on number of tokens alone with no concept of message as an atomic unit.

Now if the program is chat-aware, it would keep a rolling window of 2-tuples: (role, message) or just (question, answer).
When rollover happens, it would discard enough old tuples so that the rest fully fit in the half context window without being cut off:

A chat between a user and an assistant.
USER: What is 2 + 2?
ASSISTANT: 4
[Continue generation from here]

This is of course, model specific because every model has a different prompt format.

ghost · 2023-07-10T12:48:00Z

I understand this functions:

--prompt "A chat between a curious user and an artificial intelligence assistant. The assistant is rude. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Tell me about llama. ASSISTANT: " \

I replaced "rude" with "dumb", and that worked well. I tried, "The assistant is a goat.", but it failed - I'm trying to understand, is it because I needed something else in the negative prompt or something?

Is there a simple way to understand the relationship between prompt and negative prompt? If not then can I assign a descriptor (to assistant for example) greater than 1 word?

Thank you.

SlyEcho · 2023-07-10T12:55:55Z

The server API supports this a little bit, while it allows an input of any length and also generates any amount of tokens, it will give a flag in the response when the context is truncated. I even added an example of this, deleting the first chat response-reply:

llama.cpp/examples/server/chat.mjs

Lines 68 to 70 in 1d16309

    
           if (message.truncated) { 
        
               chat.shift() 
        
           }

Yeah, n_keep is a crutch, especially since the user needs to know their string lengths in terms of tokens. Maybe we could have a special markup format but this introduces a lot of complexity. Another option could be to search for the next assistant or user prefix and cut before it.

bullno1 · 2023-07-10T17:35:12Z

Maybe we could have a special markup format but this introduces a lot of complexity. Another option could be to search for the next assistant or user prefix and cut before it.

It's certainly doable. For one, we already know which text is model-generated, user-generated or just "marker" like "USER:" or "ASSISTANT:" which comes from the CLI args.
Instead of storing a rolling window of tokens, it can be a rolling window of (role, tokens) pairs or (question, answer) pairs.
It's just a matter of how far do we want to push the examples.

I see it more as a playground to test out techniques and models rather than a full-fledged application.
With long chat sessions, one would naturally expect some persistent storage and recall too.
That's a whole another topic.

I replaced "rude" with "dumb", and that worked well. I tried, "The assistant is a goat.", but it failed - I'm trying to understand, is it because I needed something else in the negative prompt or something?

I'm still playing around with it.
With these settings (the seed hopefully helps to reproduce beyond my machine):

bin/Release/main \
    --mirostat 2 \
    -s 1689004840 \
    -ngl 63 \
    -m ~/Downloads/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_M.bin \
    --verbose-prompt \
    --prompt "A chat between a curious user and a goat. The assistant talks like a goat. USER: Tell me about llama. ASSISTANT: " \
    --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant talks like a human. USER: Tell me about llama. ASSISTANT: " \
    --cfg-scale 5 \
    --cfg-smooth-factor 0.855

I got:

A chat between a curious user and a goat. The assistant talks like a goat. USER: Tell me about llama. ASSISTANT: Well hello goaty friend! If you want to know about me then you'll have to baaask me some questions! As for llama, they are domesticated South American camelids that are often kept as pack animals. They have long necks and fluffy coats that come in a variety of colors. Like goats, llamas are social animals and prefer to live in groups. They are also very intelligent and have been used as guards for their herds. Have any other questions for ol goatgoat?" [end of text]

So it's just dumb goat puns.
I guess the model never get trained on how ... a goat talks to human so that's how it interprets it.

"Prompt engineering" is often less engineering and more dark art.
Intuitively and also based on my reading of the paper, negative prompt is what not to do.
I try to keep it opposite from the positive prompt (e.g: goat vs human).
That said, I find that the negative prompt should also stay close to the trained base prompt: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions".
So "The assistant talks like a dog" as negative prompt for "The assistant talks like a cat" probably doesn't work as well as just contrasting against "The assistant talks like human".
I tried the dog vs cat one and it gives a lot of gibberish.

For the "What is 1+1?" example, I tried the vanilla base prompt and the effect is not as pronounced so I got an idea and just put "The assistant gives concise answer" in the negative prompt and voila it went on and on about Taoism and philosophy of 1+1.

ghost · 2023-07-10T18:41:03Z

Intuitively and also based on my reading of the paper, negative prompt is what not to do. I try to keep it opposite from the positive prompt (e.g: goat vs human).

This makes sense to me, thank you.

For the "What is 1+1?" example, I tried the vanilla base prompt and the effect is not as pronounced so I got an idea and just put "The assistant gives concise answer" in the negative prompt and voila..

I'll try and keep this in mind. I tried your parameters:

> ./main -m ~/vicuna-7b-v1.3.ggmlv3.q4_0.bin --mirostat 2 --verbose-prompt --prompt "A chat between a curious user and a goat. The assistant talks like a goat." --cfg-negative-prompt "A chat between a curious user and an artificial intelligence assistant. The assistant talks like a human. " --in-prefix "USER:" --in-suffix "ASSISTANT:" --reverse-prompt "USER: " --interactive --interactive-first  --cfg-scale 4 --color -b 7 -t 3  --cfg-smooth-factor 0.855

A chat between a curious user and a goat. The assistant talks like a goat.

USER:What's your name?
ASSISTANT: My name is Billy.

USER: Hi Billy, what's something fun to do in a grassy field?
ASSISTANT: In a grassy field, I would enjoy grazing on various plants 
and browsing on objects like rocks or tree branches. Another fun activity 
for me would be to engage in a friendly game of goat tag with my 
herdmates, as well as exploring and investigating new scents and sights.

The results look good!

ggerganov · 2023-07-10T18:45:59Z

@bullno1

you'll have to baaask me some questions!

sounds pretty goat to me 🤣

Vermeille · 2023-07-10T22:21:34Z

I knew our paper had some wild potential lmao. Reading this thread makes me so happy hahaha. Definitely check out the last appendix of the paper for more ideas, we released all the prompts, and some gave hilarious results!

bullno1 changed the title ~~Draft: Implement classifier-free guidance~~ Implement classifier-free guidance Jul 7, 2023

ggerganov mentioned this pull request Jul 7, 2023

llama : add support for Classifier-Free Guidance (CFG) sampling to stay on topic better #2083

Closed

bullno1 added 9 commits July 8, 2023 22:13

Initial implementation

d09d5ed

Remove debug print

4786300

Restore signature of llama_init_from_gpt_params

8ba5b13

Free guidance context

8f91b52

Make freeing of guidance_ctx conditional

114d4c5

Make Classifier-Free Guidance a sampling function

422a7ff

Correct typo. CFG already means context-free grammar.

66eb048

Record sampling time in llama_sample_classifier_free_guidance

8e66e59

Shift all values by the max value before applying logsoftmax

325fc88

bullno1 force-pushed the cfg branch from 9eb8aa5 to 325fc88 Compare July 8, 2023 14:14

ggerganov reviewed Jul 9, 2023

View reviewed changes

apage43 mentioned this pull request Jul 10, 2023

kv cache swapping nomic-ai/gpt4all#1161

Closed

2 tasks

Fix styling based on review

abf164d

bullno1 requested a review from ggerganov July 11, 2023 10:26

stevelaskaridis mentioned this pull request Jul 11, 2023

Integration of Classifier-Free Guidance with HF models brave-experiments/simba-evaluation-harness#6

Open

ggerganov approved these changes Jul 11, 2023

View reviewed changes

ggerganov merged commit c9c74b4 into ggerganov:master Jul 11, 2023
24 checks passed

SlyEcho mentioned this pull request Jul 13, 2023

Add CFG to server #2217

Draft

4 tasks

philpax mentioned this pull request Jul 17, 2023

Add classifier-free guidance rustformers/llm#377

Open

abb128 mentioned this pull request Jul 19, 2023

Classifier-free guidance abetlen/llama-cpp-python#506

Open

Vermeille mentioned this pull request Jul 21, 2023

remove cfg smooth factor #2280

Merged

walking-octopus mentioned this pull request Jul 22, 2023

Investigate PG-TD (Planning-Guided Transformer Decoding) sampling #2324

Closed

KerfuffleV2 mentioned this pull request Aug 17, 2023

Add --cfg-negative-prompt-file option for examples #2591

Merged

Piezoid mentioned this pull request Sep 14, 2023

Improving the repetition penalty #331

Closed

martindevans mentioned this pull request Feb 26, 2024

Classifier Free Guidance SciSharp/LLamaSharp#536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement classifier-free guidance #2135

Implement classifier-free guidance #2135

bullno1 commented Jul 7, 2023 •

edited

Loading

bullno1 commented Jul 7, 2023

Vermeille commented Jul 7, 2023 •

edited

Loading

bullno1 commented Jul 7, 2023

BadisG commented Jul 8, 2023

Vermeille commented Jul 8, 2023 •

edited

Loading

ghost commented Jul 8, 2023

ggerganov left a comment

ggerganov Jul 7, 2023

ggerganov Jul 7, 2023

ggerganov Jul 9, 2023

ggerganov Jul 9, 2023

ggerganov Jul 9, 2023

ggerganov Jul 9, 2023

ghost commented Jul 9, 2023 •

edited by ghost

Loading

AutonomicPerfectionist commented Jul 9, 2023

ghost commented Jul 9, 2023

SlyEcho commented Jul 9, 2023

bullno1 commented Jul 10, 2023

ghost commented Jul 10, 2023

SlyEcho commented Jul 10, 2023

bullno1 commented Jul 10, 2023 •

edited

Loading

ghost commented Jul 10, 2023

ggerganov commented Jul 10, 2023

Vermeille commented Jul 10, 2023

	std::vector<llama_token> guidance_embd;
	std::vector<llama_token> embd_guidance;

		float guidance_logit = guidance_logits[i];
		float base_logit = candidates->data[i].logit;

Implement classifier-free guidance #2135

Implement classifier-free guidance #2135

Conversation

bullno1 commented Jul 7, 2023 • edited Loading

bullno1 commented Jul 7, 2023

Vermeille commented Jul 7, 2023 • edited Loading

bullno1 commented Jul 7, 2023

BadisG commented Jul 8, 2023

Vermeille commented Jul 8, 2023 • edited Loading

ghost commented Jul 8, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Jul 7, 2023

Choose a reason for hiding this comment

ggerganov Jul 7, 2023

Choose a reason for hiding this comment

ggerganov Jul 9, 2023

Choose a reason for hiding this comment

ggerganov Jul 9, 2023

Choose a reason for hiding this comment

ggerganov Jul 9, 2023

Choose a reason for hiding this comment

ggerganov Jul 9, 2023

Choose a reason for hiding this comment

ghost commented Jul 9, 2023 • edited by ghost Loading

AutonomicPerfectionist commented Jul 9, 2023

ghost commented Jul 9, 2023

SlyEcho commented Jul 9, 2023

bullno1 commented Jul 10, 2023

ghost commented Jul 10, 2023

SlyEcho commented Jul 10, 2023

bullno1 commented Jul 10, 2023 • edited Loading

ghost commented Jul 10, 2023

ggerganov commented Jul 10, 2023

Vermeille commented Jul 10, 2023

bullno1 commented Jul 7, 2023 •

edited

Loading

Vermeille commented Jul 7, 2023 •

edited

Loading

Vermeille commented Jul 8, 2023 •

edited

Loading

ghost commented Jul 9, 2023 •

edited by ghost

Loading

bullno1 commented Jul 10, 2023 •

edited

Loading