Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

kir-gadjello · 2023-11-22T05:34:21Z

This PR allows one to use OpenAI ChatGPT API-compatible software right away after building just the server binary, without any wrappers like "LocalAI" or "api_like_OAI.py". I think merging this will remove friction around one of the most popular llama.cpp usecases, making wrappers like LocalAI less necessary. This might drive adoption up.

It works: I use this code with several projects, including https://github.com/talkingwallace/ChatGPT-Paper-Reader and third-party ChatGPT UIs. Both streaming and synchronous API modes are supported and tested by yours truly in real applications.

For now, only ChatML-based models are supported, but it's a matter of adding different prompt formatters, and the ChatML is becoming the de facto standard anyway.

No breaking change is made to the existing codebase: it's just a completely different, optional API endpoint and a few state variables, only server.cpp was modified.

After merging this we could work on replicating other OAI APIs such as vision.

…r example

ggerganov

Nice addition!

Use 4 space indentation
Column width is 120 characters

ggerganov · 2023-11-22T10:32:55Z

examples/server/server.cpp

+
+std::string format_chatml(std::vector<json> messages) {
+
+  std::ostringstream chatml_msgs;
+
+  // iterate the array
+  for (auto it = messages.begin(); it != messages.end(); ++it) {
+    chatml_msgs << "<|im_start|>"
+                << json_value(*it, "role", std::string("user")) << '\n';
+    chatml_msgs << json_value(*it, "content", std::string(""))
+                << "<|im_end|>\n";
+  }
+
+  chatml_msgs << "<|im_start|>assistant" << '\n';
+
+  return chatml_msgs.str();
+}
+


I wonder if we should enable tokenization of special tokens by default in sever example:

llama.cpp/examples/server/server.cpp

Lines 610 to 653 in 8e672ef

std::vector<llama_token> tokenize(const json & json_prompt, bool add_bos) const

{

// If `add_bos` is true, we only add BOS, when json_prompt is a string,

// or the first element of the json_prompt array is a string.

std::vector<llama_token> prompt_tokens;

if (json_prompt.is_array())

{

bool first = true;

for (const auto& p : json_prompt)

{

if (p.is_string())

{

auto s = p.template get<std::string>();

std::vector<llama_token> p;

if (first)

{

p = ::llama_tokenize(ctx, s, add_bos);

first = false;

}

else

{

p = ::llama_tokenize(ctx, s, false);

}

prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());

}

else

{

if (first)

{

first = false;

}

prompt_tokens.push_back(p.template get<llama_token>());

}

}

}

else

{

auto s = json_prompt.template get<std::string>();

prompt_tokens = ::llama_tokenize(ctx, s, add_bos);

}

return prompt_tokens;

}

Here, the calls to llama_tokenize currently default to special == false because there are some edge cases where the user could be using html tags such as <s> and they would be tokenized incorrectly as special tokens.

Technically, the correct solution is the one we have in main:

llama.cpp/examples/main/main.cpp

Lines 787 to 792 in 8e672ef

const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true);

const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);

const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true);

LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str());

But this seems a bit tricky to implement in server.

I think there is no point in sacrificing ChatML compat just for that extreme edge case, so maybe we should use special == true in server.

@staviq and others, any thoughts?

Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs? Note that I added oaicompat field to the slot, so it should be easy to do a tokenize(..., special = slot.oaicompat)
Can we update tokenize to be able to give it a whitelist of allowed special tokens? ChatML tokens are special due to "|"-character and won't clash with HTML tags, so allowing them and them only should not do harm.

Preferably, I would leave this decision & implementation to core contributors.

There are many options and we can do anything we like, though it's a bit difficult for me to make the decision. There are so many variations of these templates and tokenization is incredibly over-complicated in general.

Can we enable it only in the context of /v1/chat/completions (and maybe other future) OpenAI compatible APIs?

Yes, let's do that.

Can we update tokenize to be able to give it a whitelist of allowed special tokens?

Not sure if this is needed. What issue would it fix?
The model already knows internally which are the special tokens.

kir-gadjello · 2023-11-23T01:37:58Z

I fixed code style, not sure about some places though. I hope eventually clang-format will solve it, right now it produces a global patch, so some core contributor will have to do it instead of me.
If there are no other problems, you can merge it.

examples/server/server.cpp

shibe2 · 2023-11-23T18:59:05Z

A disadvantage of processing special tokens in this way is that then the model cannot distinguish between, say, special token im_start and literal text "<|im_start|>". Even with ChatML, special tokens should not be recognized inside message body. A proper way to format prompts is to obtain identifiers of needed special tokens in a separate step and put only text through the tokenizer.

shibe2 · 2023-11-23T19:06:09Z

examples/server/server.cpp

+    for (auto it = messages.begin(); it != messages.end(); ++it) {
+        chatml_msgs << "<|im_start|>"
+                    << json_value(*it, "role", std::string("user")) << '\n';
+        chatml_msgs << json_value(*it, "content", std::string(""))


I think that content should not be mixed with message prefixes and suffixes. If content happens to contain strings "<|im_start|>" or "<|im_end|>", they should not be converted to special tokens. A better way to format to ChatML is to create an array as described in the documentation of /completion endpoint.

tobi · 2023-11-23T23:06:58Z

Thank you. Yes, this will make a huge difference.

One idea for the chat format is that we could see if we could bake in the chat templates into the gguf file metadata. Huggingface Transformers now has the chat format as meta data in the tokenizer_config.json file according to https://huggingface.co/docs/transformers/chat_templating.

Sadly it's some god awful Jinja syntax, but since we have a lot of EBNF experts here we may have a go at parsing a subset of it. Definitely would make things future safe. There are a lot of goofy template styles floating around. Although chatml is definitely the right standard.

Azeirah · 2023-11-23T23:27:14Z

Hey, this is a really cool addition!

I tried using it with a client in ObsidianMD, but I'm getting CORS issues. I know the api_like_OAI.py configures CORS with flask.

httplib doesn't offer explicit CORS support, but it is possible to configure an OPTIONS handler: yhirose/cpp-httplib#62 (comment)

tobi · 2023-11-24T02:03:51Z

I thew a few different local software pieces and libraries at it and it works really great.

Two things I think need doing:

for sanity's sake let's mount the endpoints with v1/ and without to reduce the manual error potential.
let's implement v1/models endpoint. A bunch of tools hit it to check if the endpoint is working.

Thanks again!

kir-gadjello · 2023-11-24T03:30:41Z

@tobi @shibe2 Thank you for trying this out and proposing improvements. Thing is, I'm very busy with my main project rn and so I won't be able to return to this PR to implement a proper ChatML templater engine and /models endpoint in a week. If we need to mainline this sooner, I'd be happy to accept commits into this branch 🙏

ggerganov · 2023-11-24T09:16:30Z

I continued this into a local branch: #4198

Propose that we enable special tokens by default for now, because I think handling ChatML is more important than the few edge cases where the input text would contain the string "<|im_start|>". We will fix this later as discussed above.

Will create a new issue to list the rest of the proposed improvements and look for contributors to help out with the implementation.

cebtenzzre · 2023-11-24T17:14:51Z

Closing in favor of #4198

kir-gadjello added 4 commits November 22, 2023 02:16

Add openai-compatible POST /v1/chat/completions API endpoint to serve…

a0a08ee

…r example

fix code style

2f84f5d

Update server README.md

af4d68b

Improve server README.md

9ad4d27

ggerganov approved these changes Nov 22, 2023

View reviewed changes

ggerganov requested a review from tobi November 22, 2023 10:28

ggerganov reviewed Nov 22, 2023

View reviewed changes

Fix server.cpp code style according to review

e151670

ggerganov reviewed Nov 23, 2023

View reviewed changes

examples/server/server.cpp Show resolved Hide resolved

examples/server/server.cpp Show resolved Hide resolved

examples/server/server.cpp Show resolved Hide resolved

shibe2 reviewed Nov 23, 2023

View reviewed changes

ggerganov mentioned this pull request Nov 24, 2023

server : OAI API compatibility #4198

Merged

cebtenzzre closed this Nov 24, 2023

ggerganov mentioned this pull request Nov 25, 2023

server : improvements and maintenance #4216

Open

10 tasks

ggerganov mentioned this pull request Feb 27, 2024

Add "/chat/completions" as alias for "/v1/chat/completions" #5722

Merged

BrickBee mentioned this pull request Apr 21, 2024

No special token handling in imatrix, beam-search and others #6804

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

kir-gadjello commented Nov 22, 2023 •

edited

Loading

ggerganov left a comment

ggerganov Nov 22, 2023

kir-gadjello Nov 23, 2023 •

edited

Loading

ggerganov Nov 23, 2023

kir-gadjello commented Nov 23, 2023

shibe2 commented Nov 23, 2023

shibe2 Nov 23, 2023

tobi commented Nov 23, 2023

Azeirah commented Nov 23, 2023

tobi commented Nov 24, 2023

kir-gadjello commented Nov 24, 2023

ggerganov commented Nov 24, 2023

cebtenzzre commented Nov 24, 2023

	std::vector<llama_token> tokenize(const json & json_prompt, bool add_bos) const
	{
	// If `add_bos` is true, we only add BOS, when json_prompt is a string,
	// or the first element of the json_prompt array is a string.
	std::vector<llama_token> prompt_tokens;

	if (json_prompt.is_array())
	{
	bool first = true;
	for (const auto& p : json_prompt)
	{
	if (p.is_string())
	{
	auto s = p.template get<std::string>();
	std::vector<llama_token> p;
	if (first)
	{
	p = ::llama_tokenize(ctx, s, add_bos);
	first = false;
	}
	else
	{
	p = ::llama_tokenize(ctx, s, false);
	}
	prompt_tokens.insert(prompt_tokens.end(), p.begin(), p.end());
	}
	else
	{
	if (first)
	{
	first = false;
	}
	prompt_tokens.push_back(p.template get<llama_token>());
	}
	}
	}
	else
	{
	auto s = json_prompt.template get<std::string>();
	prompt_tokens = ::llama_tokenize(ctx, s, add_bos);
	}

	return prompt_tokens;
	}


	const auto line_pfx = ::llama_tokenize(ctx, params.input_prefix, false, true);
	const auto line_inp = ::llama_tokenize(ctx, buffer, false, false);
	const auto line_sfx = ::llama_tokenize(ctx, params.input_suffix, false, true);
	LOG("input tokens: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, line_inp).c_str());

Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

Server: OpenAI-compatible POST /v1/chat/completions API endpoint #4160

Conversation

kir-gadjello commented Nov 22, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Nov 22, 2023

Choose a reason for hiding this comment

kir-gadjello Nov 23, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Nov 23, 2023

Choose a reason for hiding this comment

kir-gadjello commented Nov 23, 2023

shibe2 commented Nov 23, 2023

shibe2 Nov 23, 2023

Choose a reason for hiding this comment

tobi commented Nov 23, 2023

Azeirah commented Nov 23, 2023

tobi commented Nov 24, 2023

kir-gadjello commented Nov 24, 2023

ggerganov commented Nov 24, 2023

cebtenzzre commented Nov 24, 2023

kir-gadjello commented Nov 22, 2023 •

edited

Loading

kir-gadjello Nov 23, 2023 •

edited

Loading