kubernetes example #6546

phymbert · 2024-04-08T16:31:37Z

Motivation

Kubernetes is widely used in the industry to deploy product and application at scale.

It can be useful for the community to have a llama.cpp helm chart for the server.

I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:

https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes

References

server-cuda closes connection while still processing tasks #6545

The text was updated successfully, but these errors were encountered:

OmegAshEnr01n · 2024-04-10T01:18:01Z

Hi! I will take this up!

phymbert · 2024-04-10T08:52:19Z

Great @OmegAshEnr01n , few notes:

I think we need 2 subcharts, one for embeddings, one for generation/completions
probably need to update the schema in my branch as now the model will be downloaded by the server directly, and the related Job should be removed
need to support both HF url parameters and raw url for internal model repo like artifactory
metrics scrapping must work for prometheus community (with the resourcePodMonitoring), enterprise and ideally dynatrace
pvc must stay after the helm is un-installed
auto scalling can be done later on, but this is a must have
ideally the helm must be built by the CI and installable from gh-pages

Ping here if you have question, good luck ! Excited to use it.

phymbert · 2024-04-16T10:29:48Z

Hi @OmegAshEnr01n, are you still working on this issue ?

OmegAshEnr01n · 2024-04-17T02:42:44Z

Yes, still am. Will share a pull request over the weekend when completed.

OmegAshEnr01n · 2024-04-25T02:35:52Z

Hi @phymbert

What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      {{- range $i, $container := .Values.containers }}
      - name: my-container-{{ $i }}
        image: {{ $container.image }}
        volumeMounts:
        - name: data-volume-{{ $i }}
          mountPath: /data
      {{- end }}
      volumes:
      {{- range $i, $container := .Values.containers }}
      - name: data-volume-{{ $i }}
        persistentVolumeClaim:
          claimName: pvc-{{ $i }}
      {{- end }}

On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well.

phymbert · 2024-04-25T12:54:32Z

Embeddings model are different from the generative ones. In an RAG setup you need two models.

Prometheus is not required but if present metrics are exported.

OmegAshEnr01n · 2024-04-27T16:14:29Z

Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong.

phymbert · 2024-04-27T16:40:44Z

Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on.
There is nothing to do with the server code.

ceddybi · 2024-05-03T16:59:41Z

@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀

OmegAshEnr01n · 2024-05-05T14:46:00Z

Not yet. Currently testing it on a personal kube cluster with separate node selectors.

Perdjesk · 2024-06-20T12:53:32Z

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

/cc @mcharytoniuk

mcharytoniuk · 2024-06-20T13:13:02Z

@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:

Typical strategies like round robin or least connections are not effective for llama.cpp servers, which need slots for continuous batching and concurrent requests. ... Paddler overcomes this by maintaining a stateful load balancer that is aware of each server's available slots, ensuring efficient request distribution.

Thanks the mention. I maintain that point. Of course round robin will work. "Least connections" will be better (but it does not have to reflect how many slots are being used), but the issue is - prompts can take a long, varying time to finish. With round robin it is very possible to distribute the load unevenly (for example if one of the servers was unlucky and is still processing a few of huge prompts). To me the ideal is balancing based on slots and have some requests queue on top of that (which I plan to add to paddler btw :)). I love the slots idea because they make the infra really predictable.

phymbert · 2024-07-07T17:50:41Z

@phymbert From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer?

Firstly, it's better to use native llama.cpp KV cache, so if you have k8s nodes with 2-4 A/H100, having one pod per node using all VRAM and as many as possible slots/cache for the server will give you the maximum performance, but not HA.
Then, regarding load balancing, I tested both IP affinity, rb and least conn., no significant differences found. I think it depends of the dataset/usecase or client distribution.

Maybe an interesting approach would be to prioritize upfront based on input tokens size. Nonetheless you cannot predict output tokens size.

I mainly faced issues with long living http connections, IMHO we need a better architecture for this than SSE.

OmegAshEnr01n · 2024-07-21T12:00:35Z

@phymbert ive made a pull request.

phymbert · 2024-07-22T16:23:44Z

ive made a pull request.

The PR is on my fork:

phymbert#7

We need to bring it here somehow

anencore94 · 2024-08-08T09:04:54Z

Hope to meet soon

phymbert added enhancement New feature or request server/webui kubernetes Helm & Kubernetes help wanted Extra attention is needed labels Apr 8, 2024

phymbert assigned OmegAshEnr01n Apr 10, 2024

phymbert mentioned this issue Apr 11, 2024

server : improvements and maintenance #4216

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes example #6546

kubernetes example #6546

phymbert commented Apr 8, 2024 •

edited

Loading

OmegAshEnr01n commented Apr 10, 2024

phymbert commented Apr 10, 2024 •

edited

Loading

phymbert commented Apr 16, 2024

OmegAshEnr01n commented Apr 17, 2024

OmegAshEnr01n commented Apr 25, 2024 •

edited

Loading

phymbert commented Apr 25, 2024

OmegAshEnr01n commented Apr 27, 2024

phymbert commented Apr 27, 2024

ceddybi commented May 3, 2024

OmegAshEnr01n commented May 5, 2024

Perdjesk commented Jun 20, 2024 •

edited

Loading

mcharytoniuk commented Jun 20, 2024

phymbert commented Jul 7, 2024

OmegAshEnr01n commented Jul 21, 2024

phymbert commented Jul 22, 2024 •

edited

Loading

anencore94 commented Aug 8, 2024

kubernetes example #6546

kubernetes example #6546

Comments

phymbert commented Apr 8, 2024 • edited Loading

Motivation

References

OmegAshEnr01n commented Apr 10, 2024

phymbert commented Apr 10, 2024 • edited Loading

phymbert commented Apr 16, 2024

OmegAshEnr01n commented Apr 17, 2024

OmegAshEnr01n commented Apr 25, 2024 • edited Loading

phymbert commented Apr 25, 2024

OmegAshEnr01n commented Apr 27, 2024

phymbert commented Apr 27, 2024

ceddybi commented May 3, 2024

OmegAshEnr01n commented May 5, 2024

Perdjesk commented Jun 20, 2024 • edited Loading

mcharytoniuk commented Jun 20, 2024

phymbert commented Jul 7, 2024

OmegAshEnr01n commented Jul 21, 2024

phymbert commented Jul 22, 2024 • edited Loading

anencore94 commented Aug 8, 2024

phymbert commented Apr 8, 2024 •

edited

Loading

phymbert commented Apr 10, 2024 •

edited

Loading

OmegAshEnr01n commented Apr 25, 2024 •

edited

Loading

Perdjesk commented Jun 20, 2024 •

edited

Loading

phymbert commented Jul 22, 2024 •

edited

Loading