I implemented KV cache from scratch in the nanovlm repository (a small codebase for training my own vision language models in pure pytorch). This gave us 38% of the speedup of production. In this blog post, I will cover KV cache and all my experiences during implementation. Lessons learned are general and can be applied to all generations of autoregressive language models. Implementing it from scratch in a small codebase is a great learning experience.
introduction
The Autolagenic Language Model generates text by sampling one token at a time. During inference, the model processes a particular input sequence, predicts the next token, adds it to the sequence, and repeats this process until several stop criteria.
This stepwise production is essentially continuous.
Generate a token ti+1 t_ {i+1} the model should consider the entire sequence from T0 T_0 In Ti t_i . From the first instance of the above example ti+1 t_ {i+1} There are all the previous tokens T0 T_0 In Ti t_i (What… in,). The transformers are internally parallel, but each new prediction must pass all transformer layers completely forward, resulting in quadratic memory/calculation in terms of sequence length.
This repetition also leads to computational redundancy. In this post, we explore KV caching, an optimization technique that reduces this inefficiency.
table of contents:
Revisiting Trans Architecture
Before diving into caching, let’s revisit how attention works with trance models. The trans word model consists of stacked layers each composed.
Multi-head autocatalytic feedforward network (MLP) residual connections and layer normalization
To understand where KV caching can help, we will focus on the auto-joint mechanism, especially within a single attention head.
Let’s go through a simple Pytorch implementation to visualize important calculations.
Import Torch input_seq_length = 5
dim_model = 10
input_ids_emb = torch.randn(input_seq_length, dim_model) w_q = torch.randn(dim_model, dim_model) w_k = torch.randn(dim_model, dim_Model) input_ids_emb @w_k v = input_ids_emb @w_v
Self-joint calculation
For sequences TT Input embedding that represents AS x∈Rt×dx \in \mathbb {r}^{t \times d} the autojoint is calculated as follows:
Q = XWQ Q = XW_Q and wq∈RD×dq w_q \in \mathbb {r}^{d \times d_q}
k = xwk k = xw_k and wk∈RD×dk w_k \in \mathbb {r}^{d \times d_k}
V = XWV V = XW_V and wv∈RD×dv w_v \in \mathbb {r}^{d \times d_v}
Causal mask mm To prevent future token access
The final output is:
Note (x;q;q,k,v) = softmax(qk⊤⋅mdk)v \text {attention}(x;q,k,v) = \text {softmax}} \left(\frac {qk^\cdot m} {\sqrt {d_k}}} \right)V
Below is a minimal Pytorch equivalent using a causal mask.
Import torch.nn.functional As f attention_scores = q @kt causal_mask = torch.tril (torch.ones(input_seq_length, input_seq_length)) masked_scores = attention_scores.masked_fill (causal_mask ==== 0, float(‘-inf’)1)output = attention_weights @v
Where redundancy creeps up
In the autoregressive generation, the model generates one token at a time. Each step is reconstructed QQ , KK and VV For the entire sequence, even if the previous token has not been changed.
new_token_emb = torch.randn(1dim_model) extended_input = torch.cat((input_ids_emb, new_token_emb), dim =0)q_ext = extended_input @w_q k_ext = extended_input @w_k v_ext = extended_input @w_v
To check redundancy:
torch.testing.assert_close(k,k_ext(:input_seq_length)) torch.testing.assert_close(v,v_ext(:input_seq_length))
These checks are for everything except the latest tokens. KK and VV Same as previously calculated values.
Original (5×5): Extended (6×6): ■ ■ ■ ■ ■ ■ □ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ Already calculated and reused □= Unnecessarily recalculated
Most of the careful calculations are repeated unnecessary. This becomes more expensive as the sequence grows.
How KV caching fixes it
To eliminate this inefficiency, we use KV caching.
After processing the first prompt, it caches the calculated key ( KK ) and value ( VV ) For each layer. During generation, we just do the calculations KK and VV If a new token is added to the cache. I’ll calculate it QQ For the current token, use it with the cached one KK and VV Gets the output.
This will change the generation from a full sequence recalculation to a lightweight incremental update.
cache This cache is a layer-by-layer dictionary with keys “keys” and “values”, each with a shape (batch_size, num_heads, seq_len_cached, head_dim).
This is the basis for how modern LLMs can efficiently generate long outputs.
KV cache in Nanovlm: From theory to practice
Now that we understand the theory behind KV caching, let’s take a look at how it is actually implemented within the Nanovlm repository. This is an ideal testbed as it is a very concise and self-contained codebase.
KV caching is enabled in three key components of the model.
Note blocks that use and update KV caches are used to track the cache per layer, and the generated loop to separate layers (the initial path with the input prompt) and the sequential decoding phase.
1. Update the KV cache with note block
In the LanguageModelGroupEdattention class, the forward function is changed to accept and update the key and value cache (block_kv_cache).
Previously, the model was recalculated KK and VV Steps of all generations. Now we just do the calculations I know K_{\text {new}} , vnew v _ {\text {new}} For the current token, add them to the cached value.
def forward(self, x, cos, sin, attence_mask =noneblock_kv_cache =none): is_prefill = block_kv_cache teeth none
b, t_curr, c = x.size() q_curr, k_curr, v_curr = project_current_tokens(x) q, k_rotated = apply_rotary_pos_embd(q_curr, k_curr, cos, sin)
if do not have is_prefill and block_kv_cache (‘key’)) teeth do not have none:k = torch.cat((block_kv_cache())‘key’), k_rotated), dim =2)v = torch.cat((block_kv_cache())‘value’), v_curr), dim =2))
Other than that:k,v = k_rotated,v_curr block_kv_cache = {‘key’:k, ‘value’:v}
return attention_output, block_kv_cache
2. Tracking caches between layers
The LanguageModel class introduces cache tracking for each layer. The start_pos argument helps the model of the newly generated token calculate the correct rotational position encoding.
def forward(self, x, kv_cache =none,start_pos =0): t_curr = x.size(1) position_ids = torch.arange(start_pos, start_pos + t_curr, device = x.device) cos, sin = self.rotary_embd(position_ids)
for Me, block in I’ll list it(self.blocks): x, kv_cache(i)=block(x, cos, sin, attence_mask, kv_cache(i))
return x, kv_cache kv_cache: A list of dictionaries per translayer that holds previous keys and values. start_pos: Makes the rotary embedding match the current generation index.
3. Generation Loop Prefill vs decoding
The biggest architectural change is the Generate() method of the VisionLanguageModel.
Divide the generation into two stages.
Prefill phase: Encodes the full prompt to build an initial cache. Decode Phase: Generate tokens one at a time using cached key/values. Prefill Phase (cache structure) (prompt: “what is”) → (transformer) → (transformer) → (cache of all layers: k, v all layers) decoding phase (token byte-ken)
The corresponding code is:
prompt_output, kv_cache_list = self.forward(inputs, kv_cache =none,start_pos =0
))
for I in range(max_new_tokens): next_token = sample_from(prompt_output) decode_output, kv_cache_list = self.forward (next_token, kv_cache = kv_cache_list, start_pos = current_position) prompt_output = decode_output
By isolating these phases, it avoids redundant calculations and dramatically speeds up inference, especially for long prompts.
Change Summary
Module Original Behavior New Behavior Langagemodelgroupedattention.forward reviews QQ , KK , VV kv cache languagemodel.forward in every step, the memory of the state track before the kv cache, start_pos visionlanguagemodel.
Summary: Why KV Caches Are Important
Profit description Incremental growth cache increases per column per new token location recognition decoding START_POS.
KV cache eliminates unnecessary computations during autoregressive generation, allowing faster and more efficient inference, especially in long sequences and real-time applications. This is a trade-off between speed and memory, and its drawbacks can limit fancy inference schemes such as more complex code and beam search. KV caches are a popular way to speed up LLM inference and are a popular way to allow them to be run on consumer hardware.