Skip to content

Someone can help me to understand the KV cache? #9677

Answered by ggerganov
walker-ai asked this question in Q&A
Discussion options

You must be logged in to vote
  1. The n_kv indeed grows gradually, but in chunks of 32 or 256 (determined by llama_kv_cache_get_padding()):

llama.cpp/src/llama.cpp

Lines 17186 to 17187 in 589b48d

const uint32_t pad = llama_kv_cache_get_padding(cparams);
kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));

Padded values are masked during the attention calculation.

  1. The offset is always 0 since we view the KV cache buffers from their beginning up to n_kv.

Technically, n_kv could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the…

Replies: 1 comment 6 replies

Comment options

You must be logged in to vote
6 replies
@ggerganov
Comment options

@walker-ai
Comment options

@ggerganov
Comment options

@walker-ai
Comment options

@ggerganov
Comment options

Answer selected by walker-ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants