kv cache from scratch in nanovlm

I implemented KV cache from scratch in the nanovlm repository (a small codebase for training my own vision language models in pure pytorch). This gave us 38% of the speedup of production. In this blog post, I will cover KV cache and all my experiences during implementation. Lessons learned are general and can be applied to all generations of autoregressive language models. Implementing it from scratch in a small codebase is a great learning experience.

introduction

The Autolagenic Language Model generates text by sampling one token at a time. During inference, the model processes a particular input sequence, predicts the next token, adds it to the sequence, and repeats this process until several stop criteria.

This stepwise production is essentially continuous.

Generate a token $t_ {i+1}$

This repetition also leads to computational redundancy. In this post, we explore KV caching, an optimization technique that reduces this inefficiency.

table of contents:

Revisiting Trans Architecture

Before diving into caching, let’s revisit how attention works with trance models. The trans word model consists of stacked layers each composed.

Multi-head autocatalytic feedforward network (MLP) residual connections and layer normalization

To understand where KV caching can help, we will focus on the auto-joint mechanism, especially within a single attention head.

Let’s go through a simple Pytorch implementation to visualize important calculations.

Import Torch input_seq_length = 5
dim_model = 10

input_ids_emb = torch.randn(input_seq_length, dim_model) w_q = torch.randn(dim_model, dim_model) w_k = torch.randn(dim_model, dim_Model) input_ids_emb @w_k v = input_ids_emb @w_v

Self-joint calculation

For sequences $TT Input embedding that represents AS x∈Rt×dx \in \mathbb {r}^{t \times d}$

$Q = XW_Q$

Causal mask $To prevent future token access$

The final output is:

$\text {attention}(x;q,k,v) = \text {softmax}} \left(\frac {qk^\cdot m} {\sqrt {d_k}}} \right)V$

Below is a minimal Pytorch equivalent using a causal mask.

Import torch.nn.functional As f attention_scores = q @kt causal_mask = torch.tril (torch.ones(input_seq_length, input_seq_length)) masked_scores = attention_scores.masked_fill (causal_mask ==== 0, float(‘-inf’)1)output = attention_weights @v

Where redundancy creeps up

In the autoregressive generation, the model generates one token at a time. Each step is reconstructed $, and For the entire sequence, even if the previous token has not been changed.$

new_token_emb = torch.randn(1dim_model) extended_input = torch.cat((input_ids_emb, new_token_emb), dim =0)q_ext = extended_input @w_q k_ext = extended_input @w_k v_ext = extended_input @w_v

To check redundancy:

torch.testing.assert_close(k,k_ext(:input_seq_length)) torch.testing.assert_close(v,v_ext(:input_seq_length))

These checks are for everything except the latest tokens. $and Same as previously calculated values.$

Original (5×5): Extended (6×6): ■ ■ ■ ■ ■ ■ □ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ ■ ■ ■ ■ ■ □ Already calculated and reused □= Unnecessarily recalculated

Most of the careful calculations are repeated unnecessary. This becomes more expensive as the sequence grows.

How KV caching fixes it

To eliminate this inefficiency, we use KV caching.

After processing the first prompt, it caches the calculated key ( $) and value () For each layer. During generation, we just do the calculations and If a new token is added to the cache. I’ll calculate it For the current token, use it with the cached one and Gets the output.$

This will change the generation from a full sequence recalculation to a lightweight incremental update.

cache This cache is a layer-by-layer dictionary with keys “keys” and “values”, each with a shape (batch_size, num_heads, seq_len_cached, head_dim).

This is the basis for how modern LLMs can efficiently generate long outputs.

KV cache in Nanovlm: From theory to practice

Now that we understand the theory behind KV caching, let’s take a look at how it is actually implemented within the Nanovlm repository. This is an ideal testbed as it is a very concise and self-contained codebase.

KV caching is enabled in three key components of the model.

Note blocks that use and update KV caches are used to track the cache per layer, and the generated loop to separate layers (the initial path with the input prompt) and the sequential decoding phase.

1. Update the KV cache with note block

In the LanguageModelGroupEdattention class, the forward function is changed to accept and update the key and value cache (block_kv_cache).

Previously, the model was recalculated $KK and VV Steps of all generations. Now we just do the calculations I know K_{\text {new}}$

def forward(self, x, cos, sin, attence_mask =noneblock_kv_cache =none): is_prefill = block_kv_cache teeth none
b, t_curr, c = x.size() q_curr, k_curr, v_curr = project_current_tokens(x) q, k_rotated = apply_rotary_pos_embd(q_curr, k_curr, cos, sin)

if do not have is_prefill and block_kv_cache (‘key’)) teeth do not have none:k = torch.cat((block_kv_cache())‘key’), k_rotated), dim =2)v = torch.cat((block_kv_cache())‘value’), v_curr), dim =2))
Other than that:k,v = k_rotated,v_curr block_kv_cache = {‘key’:k, ‘value’:v}
return attention_output, block_kv_cache

2. Tracking caches between layers

The LanguageModel class introduces cache tracking for each layer. The start_pos argument helps the model of the newly generated token calculate the correct rotational position encoding.

def forward(self, x, kv_cache =none,start_pos =0): t_curr = x.size(1) position_ids = torch.arange(start_pos, start_pos + t_curr, device = x.device) cos, sin = self.rotary_embd(position_ids)

for Me, block in I’ll list it(self.blocks): x, kv_cache(i)=block(x, cos, sin, attence_mask, kv_cache(i))

return x, kv_cache kv_cache: A list of dictionaries per translayer that holds previous keys and values. start_pos: Makes the rotary embedding match the current generation index.

3. Generation Loop Prefill vs decoding

The biggest architectural change is the Generate() method of the VisionLanguageModel.

Divide the generation into two stages.

Prefill phase: Encodes the full prompt to build an initial cache. Decode Phase: Generate tokens one at a time using cached key/values. Prefill Phase (cache structure) (prompt: “what is”) → (transformer) → (transformer) → (cache of all layers: k, v all layers) decoding phase (token byte-ken)

The corresponding code is:

prompt_output, kv_cache_list = self.forward(inputs, kv_cache =none,start_pos =0
))

for I in range(max_new_tokens): next_token = sample_from(prompt_output) decode_output, kv_cache_list = self.forward (next_token, kv_cache = kv_cache_list, start_pos = current_position) prompt_output = decode_output

By isolating these phases, it avoids redundant calculations and dramatically speeds up inference, especially for long prompts.

Change Summary

Module Original Behavior New Behavior Langagemodelgroupedattention.forward reviews $,, kv cache languagemodel.forward in every step, the memory of the state track before the kv cache, start_pos visionlanguagemodel.$

Summary: Why KV Caches Are Important

Profit description Incremental growth cache increases per column per new token location recognition decoding START_POS.

KV cache eliminates unnecessary computations during autoregressive generation, allowing faster and more efficient inference, especially in long sequences and real-time applications. This is a trade-off between speed and memory, and its drawbacks can limit fancy inference schemes such as more complex code and beam search. KV caches are a popular way to speed up LLM inference and are a popular way to allow them to be run on consumer hardware.

versatileai

See Full Bio

What's Hot

Reddit appeals to humanity over AI data scraping

Grassley discusses the AI whistleblower protection law in a “start point” interview

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

Reddit appeals to humanity over AI data scraping

AI enables the transition from enablement to strategic leadership

Gemini 2.5 native audio features

New Star: Discover why 보니 is the future of AI art

How to use Olympic coders locally for coding

Dell, IBM and HPE must operate at a single digit margin when it comes to the server market, and only gets worse

Most Popular

New Star: Discover why 보니 is the future of AI art

How to use Olympic coders locally for coding

Dell, IBM and HPE must operate at a single digit margin when it comes to the server market, and only gets worse

Don't Miss

Reddit appeals to humanity over AI data scraping

Grassley discusses the AI whistleblower protection law in a “start point” interview

Piclumen Art V1: Next Generation AI Image Generation Model Launches for Digital Creators | Flash News Details

Subscribe to Updates

What's Hot

kv cache from scratch in nanovlm

introduction

Revisiting Trans Architecture

Self-joint calculation

Where redundancy creeps up

How KV caching fixes it

KV cache in Nanovlm: From theory to practice

1. Update the KV cache with note block

2. Tracking caches between layers

3. Generation Loop Prefill vs decoding

Change Summary

Summary: Why KV Caches Are Important

Related Posts