-
In the ...
struct ggml_tensor * k =
ggml_view_3d(ctx, kv.k_l[il],
n_embd_head_k, n_kv, n_head_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
0);
cb(k, "k", il);
// split cached v into n_head heads
struct ggml_tensor * v =
ggml_view_3d(ctx, kv.v_l[il],
n_kv, n_embd_head_v, n_head_kv,
ggml_element_size(kv.v_l[il])*n_ctx,
ggml_element_size(kv.v_l[il])*n_ctx*n_embd_head_v,
0);
cb(v, "v", il);
struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
cb(kqv, "kqv", il);
struct ggml_tensor * kqv_merged = ggml_permute(ctx, kqv, 0, 2, 1, 3);
cb(kqv_merged, "kqv_merged", il);
... There are two parameters that I am concerned about:
Can anyone help with the above two questions? Thanks a lot. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 6 replies
-
Lines 17186 to 17187 in 589b48d Padded values are masked during the attention calculation.
Technically, Lines 17182 to 17189 in 589b48d |
Beta Was this translation helpful? Give feedback.
n_kv
indeed grows gradually, but in chunks of 32 or 256 (determined byllama_kv_cache_get_padding()
):llama.cpp/src/llama.cpp
Lines 17186 to 17187 in 589b48d
Padded values are masked during the attention calculation.
n_kv
.Technically,
n_kv
could be constant and equal to the maximum KV cache size. But this would make the inference sub-optimal, because we will be attending too many unused KV cells which will increase the…