Proposal: An alternative to chat templates #6726

kaizau · 2024-04-17T16:14:13Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Jinja template support has already been discussed extensively, and I'd place the main tension between:

Keeping llama.cpp simple and maintainable
Flexibly supporting a variety of current and future templates

I'm opening this issue to propose an alternative that potentially satisfies both. As a placeholder, let's call it role templates instead of chat templates:

std::unordered_map<std::string, std::string> chatML = {
  {"system",    "<|im_start|>system\n{{content}}<|im_end|>\n"},
  {"user",      "<|im_start|>user\n{{content}}<|im_end|>\n"},
  {"assistant", "<|im_start|>assistant\n{{content}}<|im_end|>\n"},
};

std::unordered_map<std::string, std::string> gemma = {
  // Models with no system message could just prepend to the first user message
  {"user",        "<start_of_turn>user\n{{content}}<end_of_turn>\n"},
  {"assistant",   "<start_of_turn>model\n{{content}}<end_of_turn>\n"},
};

std::unordered_map<std::string, std::string> mistral = {
  // Could have special "roles" for common exceptions to the pattern
  {"__begin",     "<s>"},
  {"user",        "[INST] {{content}} [/INST]"},
  {"assistant",   "{{content}}</s>"},
};

std::unordered_map<std::string, std::string> researchExperiment = {
  // Flexible enough to support whatever crazy template comes out next week
  {"user",        "<|user|>{{content}}<|end_user|>\n"},
  {"system_1",    "<|fast|>{{content}}<|end_fast|>\n"},
  {"system_2",    "<|slow|>{{content}}<|end_slow|>\n"},
  {"agent",       "<|agent|>{{content}}<|end_agent|>\n"},
  {"retriever",   "<|rag|>{{content}}<|end_rag|>\n"},
};

Just loop through the messages, get the corresponding role, and find-replace {{content}}. And add_generation_prompt is just the substring in front of the next message's {{content}}.

This format itself could be anything — JSON, YAML, key-value pairs — making it easy to adopt in non-llama.cpp contexts as well.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

For llama.cpp maintainers / model authors:

It flattens the complexity of Jinja into a simple find-replace operation. But it's still flexible enough to handle most (all?) templates.
Similar to Jinja, it gives model authors control and responsibility over formatting, instead of needing others to translate their work into this and other projects.
Even if model authors are slow to adopt the format, it could be added to GGUF conversion as a suggested part of the process.

For end users:

New models should "just work" with a much greater frequency.
This could be exposed as a config option to allow providing custom role templates.

For client apps / front ends:

It's a viable alternative to the current state, where every chat client maintains its own library of chat templates, while using llama.cpp's completion API. The fact that llama.cpp doesn't support all templates, means that every downstream chat client still needs to reinvent the wheel.

For open models, in general:

Personally, my experience adding chat templates opened my eyes to just how messy the template landscape is right now. Open models don't just lag in scale, but also have to deal with compatibility and usability issues that the closed models can sidestep.

Chat templates feel like an important thing to get right, and I think llama.cpp can greatly simplify this for the many projects that depend on it.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

I'd lean towards starting with a python script that loads metadata from a diverse set of models, renders their Jinja templates, and generates a set of tests to validate whether this approach can handle all cases. Basically, an addition / expansion to tests/test-chat-template.
llama_chat_apply_template_internal could be refactored to use role templates under-the-hood so that the existing --chat-template flag still works.
Potentially has implications for Implement (properly) different chat templates in main.cpp #6391

Happy to submit a PR or collaborate if this is a direction folks are interested in.

The text was updated successfully, but these errors were encountered:

phymbert · 2024-04-18T18:54:42Z

@ngxson, what do you think about the proposal, please?

bullno1 · 2024-04-18T22:39:50Z

Sounds cool and i'd say take it further, why even template or search&replace within a role? Just change it to "prefix" and "suffix":

// Blind code, probably wrong but that's the idea
// Each role just have a prefix & suffix.
std::unordered_map<std::string, std::pair<std::string, std::string>> chatML = {
  {"system",    {"<|im_start|>system\n", "<|im_end|>\n"}},
  {"user",      {"<|im_start|>user\n", "<|im_end|>\n"}},
  {"assistant", {"<|im_start|>assistant\n", "<|im_end|>\n"}},
};

You can even pre-tokenize the prefix/suffix too.
Now the user content can be tokenized with parse_special = false.
No more injection risk which is very overlooked right now. I don't think the Hugging face library is even handling this.
The user tokens will just be sandwiched between the special tokens.

And to round it off, add a config for "stop token(s)" too because llama 3 is using eot_id which throws off all the default configs.

Something like:

struct ChatTemplate {
    std::string start_of_conversation;  // Because bos is a thing
    std::unordered_map<std::string, std::pair<std::string, std::string>> roles;
    std::vector<std::string> stop_tokens;
}

// std::pair<std::string, std::string> should be more like `RoleConfig` with:

struct RoleConfig {
    std::string prefix;
    std::string suffix;
    // Maybe more config in the future like:
    bool is_machine_generated; //
};

Llama-3 expressed in yaml would be:

start_of_conversation: "<|begin_of_text|>"
roles:
  system:
    prefix: |
      <|start_header_id|>system<|end_header_id|>

    suffix: "<|eot_id|>"
  user:
    prefix: |
      <|start_header_id|>user<|end_header_id|>

    suffix: "<|eot_id|>"
  assistant:
    prefix: |
      <|start_header_id|>assistant<|end_header_id|>

    suffix: "<|eot_id|>"
stop_tokens:
  - "<|eot_id|>"

The double line break is intentional.
It doesn't have to be yaml at all though. Just a series of gguf metadata fields would work too. Like: chat_template.roles.system.prefix.
This could allow a bunch of frontends to just work out of the box given a "properly configured" model file.
And if not, we can provide a simple tool to "patch in" the metadata from an external json/yaml file if needed.
A single file is great for distribution I'd say.
This solves this concern:

Even if model authors are slow to adopt the format, it could be added to GGUF conversion as a suggested part of the process.

That said, once it's that embedded in the format, should the prefix/suffix just be pretokenized instead?
Instead of having a "preparation" step to tokenize the "always static" sequences, just have the model provide the tokens ready to be used for each role.
This further reduces the chance for misuse. System tokens should be treated differently from user input. If the default server/frontend implementation always tokenize with parse_special=false, there would be much less risk of injection.

GGUF does have nested array for metadata even.
The chat_template.stop_tokens field can be one such nested array. One inner array for each terminating sequence of tokens.
Most will just be [[eos]] though.

Although I do recall some models that ends with: \nUSER: to pass the turn to user.

Explicitly listing the roles is even better than huggingface. Again, I think it's a great idea.
Off topic: look at the funsies from gemma: https://huggingface.co/google/gemma-7b-it/blob/main/tokenizer_config.json#L1507
They throw exception in the jinja template just to ensure that roles are restricted.

Edit edit: You can even create a "auto convert" script that "works most of the time" with arbitrary templates.
Just template in a random string like "I_LIKE_BIG_LLAMA_AND_I_CANNOT_LIE" to see what is rendered. Then extract the prefix & suffix that is not that weird string.
I wouldn't recommend it though.

ngxson · 2024-04-19T04:20:55Z

The proposal here is pretty much the same as #5922 , so I suggest moving the discussion there.

The main problem is that even with this level of flexibility, some templates can't be supported without doing some code logic (for example, llama 2 template [INST] with <<SYS>> system message).

kaizau · 2024-04-20T01:57:05Z

@ngxson Ah, I missed that one. Will combine into #5922.

@bullno1 Good call on pre-tokenization, and good convergence of ideas here.

hanishkvc · 2024-04-28T20:52:59Z

Please do have a look at the below PR. Around the time when llama3 came out, I had a need to look at llama.cpp and inturn I worked on below, to try and see if one can have a generic flow which is driven by a config file to try and accomodate different modes/chat-handshake-template-standards in a generic and flexible way. The idea being that if a new template standard is added during finetuning of a model or if a new model or standard comes out, but which follows a sane convention matching the commonality that I have noticed across many models/standards, then the generic code flow itself can be used, by just updating the config file, without having to add a custom template block.

This inturn can be used by both the example/main as well as example/server or ... Currently main has been patched to use this config file based flow inturn piggy backing on its existing interactive mode and its in-prefix, in-suffix, antiprompt to a great extent.

#6834

Based on some minimal testing at my end, I seem to be able to handle the nitty gritties of around 8(+1) model using this generic code + config file based flow.

Currently json format is used for the config file, but if needed can be switched to a simpler text based config file, to avoid users of the llama.cpp library from needing to depend on json library.

The generic flow uses concept similar to what this PR is also thinking, but inturn driven by a config file, rather than hardcoding in the code, so that new model or variations can be added without having to recompile in many cases.

And also the generic flow additionally takes care of

the conditionality that is seen across few different models wrt tagging of the system-message + 1st user message flow.
the need to differentiate between begin, role-prefix, role-suffix and end tokens wrt each of the role. And inturn the variation in their insertion or not across different models, but done in a simple and generic way, by just allowing for each of these to be set or left empty wrt each of the role, as needed by that specific model.

UPDATE: I noticed that this is closed and refering to 5922, so I have added a equivalent note there.

kaizau added the enhancement New feature or request label Apr 17, 2024

kaizau closed this as completed Apr 20, 2024

kaizau mentioned this issue Apr 20, 2024

Server: possibility of customizable chat template? #5922

Closed

khimaros mentioned this issue Apr 29, 2024

Add the Command R chat format abetlen/llama-cpp-python#1382

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: An alternative to chat templates #6726

Proposal: An alternative to chat templates #6726

kaizau commented Apr 17, 2024 •

edited

Loading

phymbert commented Apr 18, 2024

bullno1 commented Apr 18, 2024 •

edited

Loading

ngxson commented Apr 19, 2024

kaizau commented Apr 20, 2024

hanishkvc commented Apr 28, 2024 •

edited

Loading

Proposal: An alternative to chat templates #6726

Proposal: An alternative to chat templates #6726

Comments

kaizau commented Apr 17, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

For llama.cpp maintainers / model authors:

For end users:

For client apps / front ends:

For open models, in general:

Possible Implementation

phymbert commented Apr 18, 2024

bullno1 commented Apr 18, 2024 • edited Loading

ngxson commented Apr 19, 2024

kaizau commented Apr 20, 2024

hanishkvc commented Apr 28, 2024 • edited Loading

kaizau commented Apr 17, 2024 •

edited

Loading

bullno1 commented Apr 18, 2024 •

edited

Loading

hanishkvc commented Apr 28, 2024 •

edited

Loading