server : parallel decoding and multimodal #3589

FSSRepo · 2023-10-11T22:18:51Z

Based on the server I've already created, I'm working on this PR. There are commented-out sections to save myself from unnecessary headaches. I haven't even tested the results of this PR yet.

Here's the plan for a complete implementation:

To make completion work in streaming mode.
Correct the mode without streaming. In progress ...
Fix embedding endpoint. Inherent issue with CUDA backend Bug: Invalid Embeddings if GPU offloaded (CUDA) #3625
Test the grammars and some options of completion.
To correct the infill function and perform the tests, it would make my life easier if someone could provide a way to test that endpoint.
Conduct completion and chat tests, with and without the UI.
To correct the context reset mode, I need to understand the approach to this, as slots share the entire assigned context, so one slot could consume most of the context, leaving others with no context.
Refactor the code by removing duplications, unnecessary comments, and prepare it for merging into the master branch.
Beam Search fixed but no tested, if there are some issue about that I will take care of that.

ggerganov · 2023-10-12T11:56:32Z

You should rebase and adopt the latest sampling changes from: #3543

FSSRepo · 2023-10-12T17:06:31Z

@KerfuffleV2 In the new way of working with samplers, if each slot has different llama_sampling_params, then I should create a llama_sampling_context every time I want to perform a completion.

Because if there were a function llama_set_sampling_params(llama_sampling_context ctx, llama_sampling_params params) to change the parameters without having to create the context every time, it would be ideal.

llama_sampling_context llama_sampling_context_init(
        const struct gpt_params & params,
                  llama_grammar * grammar) {
  llama_sampling_context result;

  result.params = params.sampling_params;
  result.grammar = grammar;
  return result;
}

It should be, to reduce headache pain.:

llama_sampling_context llama_sampling_context_init(
        const struct llama_sampling_params & params,
                  llama_grammar * grammar) {
  llama_sampling_context result;

  result.params = params;
  result.grammar = grammar;
  return result;
}

mmmmmmm

FSSRepo · 2023-10-12T19:04:41Z

The completion function is working fine; however, for some reason, the server crashes when it finishes. I haven't been able to identify the cause.

KerfuffleV2 · 2023-10-12T19:08:54Z

mmmmmmm

Yes, I agree with that. Looks like there are also likely more changes coming: #3601

for some reason, the server crashes when it finishes. I haven't been able to identify the cause.

Maybe try compiling with address sanitizing turned on? (Support recently got added to normal Makefile builds also.)

FSSRepo · 2023-10-12T19:17:00Z

Maybe try compiling with address sanitizing turned on? (Support recently got added to normal Makefile builds also.)

It was a null pointer; I was accessing an object outside of a lambda function, which always confuses me because the compiler never throws any error in that case.

FSSRepo · 2023-10-13T12:31:34Z

@ggerganov I need to know what to do in case the context gets crowded, given that the server is being used by multiple clients who share the entire context window. I assume that the context swap should only occur when a single slot is fully utilizing the context, or put a context limit per slot.

When I use the embeddings, it doesn't return the same values for some reason. I've already checked the embedding pipeline to compare it with the server implementation, but I'm not getting good results.

FSSRepo · 2023-10-13T13:28:21Z

Am I wrong?

Content:

{
  "content": "Which is your dog?"
}

PR server:

{
"embedding": [
    0.0013885498046875,
    0.02568817138671875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    -0.02291107177734375,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
]
}

master server:

{
  "embedding": [
    0.0013885498046875,
    0.02568817138671875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
    0.0013885498046875,
    -0.02291107177734375,
    -0.02291107177734375,
    -0.02291107177734375,
    0.0013885498046875,
    0.0013885498046875,
]
}

Embedding example:

2.071708 0.094213 -0.879323 -0.868794 -1.877190 0.149166 0.456107 -1.180170 -1.899972 0.324895 0.137798 1.375898 -0.929520 -0.724631 -3.895488 0.835485 2.534587 -1.334827 -1.068944 -1.522295 -2.574473 -1.089735 0.195261 0.207192 0.607332 -0.160022 -2.255088 0.348395 -0.843356 0.270982 -1.236908 -0.836289 -1.600230 -0.213995 -0.724451 1.241112 0.571537 -1.612993 -0.924750 -1.141107 0.301533 0.476756 0.925662 -0.778871 0.644059 -2.298353 -1.250245 0.888944 1.218905 1.417424 -1.571156 0.380106 1.282110 1.155357 0.254458 -0.391138 4.462999 -2.220449 -0.675895 2.643535 -0.059323 -0.518516 -1.783109 1.390705 1.135090 2.257272 ....

Edit:

This is a problem when I offload the model to cuda, related issue #3625
Fixed

FSSRepo · 2023-10-13T22:44:54Z

@ggerganov @monatis Server example now support LlaVA model

Response:

FSSRepo · 2023-10-13T23:01:13Z

Params

{
  "stream": false,
  "n_predict": -1,
  "temperature": 0.1,
  "prompt": "A chat between a curious human and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: [(img)]describe the image in detail.\nASSISTANT:",
  "image_data": "image data in base64",
  "stop": ["</s>"]
}

Response:

 The image features a desk with two computer monitors on it. One of the monitors has an unusual decoration: a small toy lamb figurine is sitting on top of it, adding a playful touch to the workspace. A keyboard and mouse are also present on the desk, indicating that this is likely a computer setup.\n\nIn addition to the main desk area, there's another smaller desk visible in the background. The room appears to be well-equipped for work or study purposes, with various items such as books and a TV placed around the space.

monatis · 2023-10-14T00:29:03Z

{
  "stream": false,
  "n_predict": -1,
  "temperature": 0.1,
  "prompt": "A chat between a curious human and an artificial intelligence assistant.  The assistant gives helpful, detailed, and polite answers to the human's questions.\nUSER: [(img)]describe the image in detail.\nASSISTANT:",
  "image_data": "image data in base64",
  "stop": ["</s>"]
}

Awesome work! Thanks @FSSRepo for taking this

FSSRepo · 2023-10-16T18:43:59Z

@Seikho Can you test now?

Seikho · 2023-10-17T03:47:14Z

The stopping string issue is all good in commit d7eca25, however I can't start the server on commit 4d18043 (your most recent commit).

FSSRepo · 2023-10-17T10:31:54Z

The stopping string issue is all good in commit d7eca25, however I can't start the server on commit 4d18043 (your most recent commit).

Some visible error or just slient stop? Send the output and parameters that you giving to the server

Seikho · 2023-10-17T13:09:22Z

At the time it was a silent error, although I didn't run with -v.
Now that I've switched back to that commit, recompiled and tried again it's running fine. Sorry for the wild goose chase.

ggerganov · 2023-10-17T20:50:52Z

I'll make a review tomorrow. First will merge #3624 and then will push changes to this PR to adapt to the new sampling API.

I need to know what to do in case the context gets crowded, given that the server is being used by multiple clients who share the entire context window. I assume that the context swap should only occur when a single slot is fully utilizing the context, or put a context limit per slot.

There should be a max context per-client (n_ctx_per_slot) and should perform the swap per client.
The total context should be n_system + n_ctx_per_slot*n_slot where n_system is the number of tokens in the common system prompt.

When I use the embeddings, it doesn't return the same values for some reason. I've already checked the embedding pipeline to compare it with the server implementation, but I'm not getting good results.

Might be fixed via #3657

Do I leave the multi-modal support to the server enabled by default or do I make it an option in CMake?

Enable by default

FSSRepo · 2023-10-17T21:11:58Z

There should be a max context per-client (n_ctx_per_slot) and should perform the swap per client. The total context should be n_system + n_ctx_per_slot*n_slot where n_system is the number of tokens in the common system prompt.

I'm somewhat confused about this. Is context_per_slot a parameter for each slot or a general one? The context_per_slot should not exceed the model's n_ctx, as a user might assign 512 to one slot and 768 to another slot, but the established context is only 1024, leaving 256 remaining that cannot be allocated.

Also, I have no idea how the seed parameter would fit into parallel decoding.

FSSRepo · 2023-10-18T19:11:19Z

@ggerganov Are you reviewing this, now? I will try to adapt to the new changes in sampling API

ggerganov · 2023-10-18T19:40:13Z

I didn't find the time today, but tomorrow this will be first thing. Until then, you can make changes if you want to. I'll mark the PR when I start reviewing.

ggerganov · 2023-10-19T10:57:09Z

@FSSRepo I can't push into your branch:

$ ▶ git push FSSRepo HEAD:master
error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/FSSRepo/llama.cpp'

Let's continue in #3677 - sent you a collaborator invite

implementing parallel decoding in server example

63f99b1

cebtenzzre marked this pull request as draft October 11, 2023 22:23

FSSRepo changed the title ~~server: parallel decoding implementation~~ server : parallel decoding implementation Oct 11, 2023

crash fixed

4712302

FSSRepo added 2 commits October 12, 2023 12:51

save dev progress

7850421

Merge branch 'master' of https://github.com/ggerganov/llama.cpp

b716eeb

refactored sampling function

29c8cdd

FSSRepo added 4 commits October 12, 2023 16:17

completion endpoint working

8148480

multiple client support

5b8e29d

grammar + no stream completion

83c2b35

cached prompt support

500ac71

FSSRepo added 7 commits October 13, 2023 11:06

chat.mjs support cached prompt + some fixes

4ba5a50

server ui now support multiple clients

6358ae5

unused change reverted

a410a9e

fixed timings per slot

b6d9e21

add context swap

a2c2d98

add changes to README.md

eb08201

llava multimodal integration

9d98cdd

fixed tokens probs

de35b47

FSSRepo mentioned this pull request Oct 13, 2023

Expose Llava as a shared library for downstream projects #3613

Merged

5 tasks

context shift fixed

d7eca25

fix llava implementation

4d18043

cebtenzzre changed the title ~~server : parallel decoding implementation~~ server : parallel decoding and multimodal Oct 16, 2023

FSSRepo added 3 commits October 17, 2023 16:21

sync README.md changes

aa2268f

Merge remote-tracking branch 'upstream/master'

fa0f22f

readme change

58f8ae9

FSSRepo added 2 commits October 17, 2023 16:53

update api like OpenAI

6c277ea

multimodal support enabled by default

ed0c11c

FSSRepo and others added 4 commits October 17, 2023 17:18

fix make bui;d errors

d2b1fac

fix multiple clients

c02c52e

fix zig build

35fd374

Merge branch 'ggerganov:master' into master

84b8f2b

new sampling API

7196c4e

FSSRepo force-pushed the master branch from 1b3a4e5 to 7196c4e Compare October 18, 2023 20:52

FSSRepo added 2 commits October 18, 2023 16:55

Merge branch 'master' of https://github.com/ggerganov/llama.cpp

8540568

latest changes of sampling API

ab2fc00

ggerganov self-requested a review October 19, 2023 09:09

ggerganov marked this pull request as ready for review October 19, 2023 09:09

ggerganov mentioned this pull request Oct 19, 2023

server : parallel decoding and multimodal (cont) #3677

Merged

cebtenzzre closed this Oct 19, 2023

ggerganov mentioned this pull request Oct 24, 2023

llama : add batched inference endpoint to server #3478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : parallel decoding and multimodal #3589

server : parallel decoding and multimodal #3589

FSSRepo commented Oct 11, 2023 •

edited

Loading

ggerganov commented Oct 12, 2023

FSSRepo commented Oct 12, 2023 •

edited

Loading

FSSRepo commented Oct 12, 2023

KerfuffleV2 commented Oct 12, 2023

FSSRepo commented Oct 12, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023

FSSRepo commented Oct 13, 2023 •

edited

Loading

monatis commented Oct 14, 2023

FSSRepo commented Oct 16, 2023

Seikho commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

Seikho commented Oct 17, 2023

ggerganov commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

FSSRepo commented Oct 18, 2023 •

edited

Loading

ggerganov commented Oct 18, 2023

ggerganov commented Oct 19, 2023

server : parallel decoding and multimodal #3589

server : parallel decoding and multimodal #3589

Conversation

FSSRepo commented Oct 11, 2023 • edited Loading

ggerganov commented Oct 12, 2023

FSSRepo commented Oct 12, 2023 • edited Loading

FSSRepo commented Oct 12, 2023

KerfuffleV2 commented Oct 12, 2023

FSSRepo commented Oct 12, 2023 • edited Loading

FSSRepo commented Oct 13, 2023 • edited Loading

FSSRepo commented Oct 13, 2023 • edited Loading

FSSRepo commented Oct 13, 2023

FSSRepo commented Oct 13, 2023 • edited Loading

monatis commented Oct 14, 2023

FSSRepo commented Oct 16, 2023

Seikho commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

Seikho commented Oct 17, 2023

ggerganov commented Oct 17, 2023

FSSRepo commented Oct 17, 2023

FSSRepo commented Oct 18, 2023 • edited Loading

ggerganov commented Oct 18, 2023

ggerganov commented Oct 19, 2023

FSSRepo commented Oct 11, 2023 •

edited

Loading

FSSRepo commented Oct 12, 2023 •

edited

Loading

FSSRepo commented Oct 12, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023 •

edited

Loading

FSSRepo commented Oct 13, 2023 •

edited

Loading

FSSRepo commented Oct 18, 2023 •

edited

Loading