Someone can help me to understand the KV cache? #9677

walker-ai · 2024-09-28T15:44:51Z

walker-ai
Sep 28, 2024

In the llama.cpp/llm_build_kqv:

...

struct ggml_tensor * k =
    ggml_view_3d(ctx, kv.k_l[il],
            n_embd_head_k, n_kv, n_head_kv,
            ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
            ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
            0);
cb(k, "k", il);


// split cached v into n_head heads
struct ggml_tensor * v =
    ggml_view_3d(ctx, kv.v_l[il],
            n_kv, n_embd_head_v, n_head_kv,
            ggml_element_size(kv.v_l[il])*n_ctx,
            ggml_element_size(kv.v_l[il])*n_ctx*n_embd_head_v,
            0);
cb(v, "v", il);


struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
cb(kqv, "kqv", il);

struct ggml_tensor * kqv_merged = ggml_permute(ctx, kqv, 0, 2, 1, 3);
cb(kqv_merged, "kqv_merged", il);

...

There are two parameters that I am concerned about:

n_kv. I think this value should grow gradually, equal to the length of the cache and the number of tokens used in this inference. However, I observe that in the perfill and decode phases, the n_kv parameter is always 32, which makes me confused.
Followed by the parameter offset , which is always 0. I think it makes sense because of its autoregressive decoding properties, but I don't know if I understand correctly.

Can anyone help with the above two questions? Thanks a lot.

Answered by ggerganov

Sep 29, 2024

The n_kv indeed grows gradually, but in chunks of 32 or 256 (determined by llama_kv_cache_get_padding()):

llama.cpp/src/llama.cpp

Lines 17186 to 17187 in 589b48d

     const uint32_t pad = llama_kv_cache_get_padding(cparams);  
   kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));

Padded values are masked during the attention calculation.

The offset is always 0 since we view the KV cache buffers from their beginning up to n_kv.

Technically, n_kv could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the…

View full answer

ggerganov · 2024-09-29T11:44:43Z

ggerganov
Sep 29, 2024
Maintainer

The n_kv indeed grows gradually, but in chunks of 32 or 256 (determined by llama_kv_cache_get_padding()):

llama.cpp/src/llama.cpp

Lines 17186 to 17187 in 589b48d

    
           const uint32_t pad = llama_kv_cache_get_padding(cparams); 
        
           kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));

Padded values are masked during the attention calculation.

The offset is always 0 since we view the KV cache buffers from their beginning up to n_kv.

Technically, n_kv could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the computations significantly for no reason. This is why we "truncate" the KV cache from the end:

llama.cpp/src/llama.cpp

Lines 17182 to 17189 in 589b48d

    
           if (!kv_self.recurrent) { 
        
               // a heuristic, to avoid attending the full cache if it is not yet utilized 
        
               // after enough generations, the benefit from this heuristic disappears 
        
               // if we start defragmenting the cache, the benefit from this will be more important 
        
               const uint32_t pad = llama_kv_cache_get_padding(cparams); 
        
               kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad))); 
        
               //kv_self.n = llama_kv_cache_cell_max(kv_self); 
        
           }

6 replies

ggerganov Oct 3, 2024
Maintainer

When the 2 requests have common prefix of n tokens, you simply submit a batch with tokens for the second request starting at position n. No need to modify the views in the code - this is already supported by the interface. More info: #3228

walker-ai Oct 4, 2024
Author

If I really need to read the different parts (Prefix + Extend) of the KV through the view function, and I'm not sure if the parameters should be set this way and if I'm understanding the parameters correctly.🤔

ggerganov Oct 7, 2024
Maintainer

If I really need to read the different parts (Prefix + Extend) of the KV through the view function, and I'm not sure if the parameters should be set this way and if I'm understanding the parameters correctly.🤔

Try to phrase this in a better way. I'm not sure what you are trying to achieve.

walker-ai Oct 7, 2024
Author

I am attempting to use a RadixTree structure to reuse the kv cache with shared prefixes. Therefore, I need to load the prefix part of the k_cache and v_cache through the view function within the llm_build_kqv function to calculate the desired prefix portion, as shown in the figure below.

In the figure, "who" and "are" share a prefix. In the code, I'm unsure if manipulating just the ne[0]/ne[1] and offset parameters of the view function would suffice to read that portion of the kv.

/* ggml_view_3d
        struct ggml_tensor * ggml_view_3d(
                struct ggml_context * ctx,
                struct ggml_tensor  * a,
                int64_t               ne0,
                int64_t               ne1,
                int64_t               ne2,
                size_t                nb1,
                size_t                nb2,
                size_t                offset) {
        
            const int64_t ne[3] = { ne0, ne1, ne2 };
        
            struct ggml_tensor * result = ggml_view_impl(ctx, a, 3, ne, offset);
        
            result->nb[1] = nb1;
            result->nb[2] = nb2;
            result->nb[3] = result->nb[2]*ne2;
        
            return result;
        }
*/

// calculate "who are"
struct ggml_tensor * k_prefix = ggml_view_3d(ctx, k_cache, 
    n_embd_head_k, /*k-length calculated by prefixing (padded)*/, n_head_kv,
    ggml_row_size(k_cache->type, n_embd_k_gqa), 
    ggml_row_size(k_cache->type, n_embd_head_k),
    0);

// maybe I should pad b_seq_len_extend[0] here
// uint32_t padded_b_seq_len_extend = GGML_PAD(b_seq_len_extend[0], 32);

struct ggml_tensor * v_prefix = ...

int extend_start_loc = ...

// calculate "you"
struct ggml_tensor* k_extend = ggml_view_3d(ctx, k_cache, 
    n_embd_head_k, /*k-length calculated by extending (padded)*/, n_head_kv,
    ggml_row_size(k_cache->type, n_embd_k_gqa), 
    ggml_row_size(k_cache->type, n_embd_head_k),
    token_start_loc * ggml_row_size(k_cache->type, n_embd_k_gqa));

ggerganov Oct 7, 2024
Maintainer

The approach that you are trying will be very hard to debug so I'm afraid I can't help much for now.

Ideally, you should access the KV cache through the llama_kv_cache_ and llama_state_ API and not worry about padding. But the API is currently very rudimentary and lacking a lot of functionality for this sort of features. I want to extend it and demonstrate some advanced prompt reuse techniques, but it will take some time to get there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Someone can help me to understand the KV cache? #9677

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

	const uint32_t pad = llama_kv_cache_get_padding(cparams);
	kv_self.n = std::min(kv_self.size, std::max(pad, GGML_PAD(llama_kv_cache_cell_max(kv_self), pad)));

Someone can help me to understand the KV cache? #9677

walker-ai Sep 28, 2024

Replies: 1 comment · 6 replies

ggerganov Sep 29, 2024 Maintainer

ggerganov Oct 3, 2024 Maintainer

walker-ai Oct 4, 2024 Author

ggerganov Oct 7, 2024 Maintainer

walker-ai Oct 7, 2024 Author

ggerganov Oct 7, 2024 Maintainer

walker-ai
Sep 28, 2024

Replies: 1 comment 6 replies

ggerganov
Sep 29, 2024
Maintainer

ggerganov Oct 3, 2024
Maintainer

walker-ai Oct 4, 2024
Author

ggerganov Oct 7, 2024
Maintainer

walker-ai Oct 7, 2024
Author

ggerganov Oct 7, 2024
Maintainer