DeepSeek released V4 today. Two MoE checkpoints are located on the hub. DeepSeek-V4-Pro (total parameters 1.6T, active 49B, DeepSeek-V4-Flash, total 284B, active 13B). Both have a 1M token context window. Benchmark numbers are competitive, but not SOTA. it doesn’t matter. The real innovation is that DeepSeek v4 is designed to efficiently support large context lengths and is therefore one of the best candidates for agent tasks.
Focus on long-running agent workloads. Currently, running a frontier open model as an agent causes problems in a predictable manner. The model will stop. You re-prompt. Tracing exceeds the context budget, KV cache fills up the GPU, and tool call round trips degrade in the middle of long tasks. V4 is built to fix these known roadblocks and point the way for the community to follow.
This post will cover three things. One is what changes the architecture is making to make long-context inference cheaper, plus the agent-specific post-training decisions that add complexity, and some points from the paper to help you reason about these changes.
Agent KV cache issue
1M context window is just capacity, not performance. Whether you can use it depends on the cost of every forward pass at that depth. For agents running long tool usage trajectories (SWE bench tasks, multi-step browse sessions, terminal sessions with hundreds of commands), all tool results are added to the context, and all subsequent tokens pay the full attention cost for everything that occurred previously.
Two numbers are important: FLOP for single token inference and KV cache size. Both increase with sequence length. For 1 million tokens, DeepSeek-V4-Pro requires 27% of single-token inference FLOPs compared to DeepSeek-V3.2, so it runs faster on the same hardware. It also uses 10% of the KV cache memory. V4-Flash reduces these numbers even further, with a 10% reduction in FLOPs and a 7% reduction in KV cache.
Comparing KV cache memory to established architectures such as 8-head grouped query attention stored in regular bfloat16 format, DeepSeek v4 requires approximately 2% less cache size. This makes it much easier to deploy context processing at very large scale.
Figure 1: Benchmark comparison (left), per-token FLOP and accumulated KV cache (right) against sequence length.
Hybrid Attention: CSA and HCA
Efficiency is improved by splitting the attention into two mechanisms and interleaving them between layers.
Compressed Sparse Attention (CSA) uses softmax gate pooling with learned positional bias to compress KV entries by a factor of 4 along the sequence dimension. Lightning indexer (FP4, multihead dot product of ReLU scores) selects the top k compressed blocks for each query. It inherits the idea of
Figure 3: CSA. The compressor collapses every four tokens into one compressed KV entry. The Lightning indexer selects the top k compressed blocks for each query. The sliding window branch processes the latest uncompressed token.
Highly Compressed Attention (HCA) compresses KV entries by a factor of 128 and removes sparse selections. All queries serve all compressed blocks densely. Compressed sequences are short enough that focused attention is cheap.
Figure 4: HCA. A heavier compressor (128x vs. 4x) is followed by intensive attention to the compressed stream, using the same sliding window branch for recency.
The layers alternate between CSA and HCA. Different layers have different attention patterns, and forcing one mechanism for all layers wastes capacity. In V4-Pro’s 61-layer stack, layers 0-1 are HCA, layers 2-60 are alternating between CSA and HCA, and the last MTP block runs only in a sliding window.
Both paths use FP8 storage for most KV entries and only BF16 for RoPE dimensions. Lightning indexers within CSA run in FP4. These storage choices and compression ratios combine to yield a 2% KV cache number.
Figure 2: Overall architecture. The attention layer alternates between CSA and HCA. The feedforward layer uses DeepSeekMoE. The remaining connections are replaced by manifold-constrained hyperconnections (mHCs).
Changes for agents
Agent workflows require efficient long context attention, but it is not sufficient. This paper describes three post-training infrastructure options that directly target agent use cases.
Thoughts interleaved across tool calls
In V3.2, the inference trace continued throughout the tool results round, but the trace was discarded each time a new user message arrived. For agents handling a single user turn, this was fine. For multi-turn agent workflows where the user sends a follow-up after the agent has already chained multiple tool calls, the model lost the accumulated inferences and had to rebuild its state.
In V4, inferences are preserved across user message boundaries when conversations include tool calls. The model maintains a complete inference history across all rounds, including the entire user turn. This allows for a consistent and cumulative chain of thought across the agent’s tasks over time. When used in a conversation without tools, the old behavior is maintained. The reasoning is flushed after each turn to keep the context concise.
Figure 7: Thinking with tools (top) maintains reasoning across all turns. If we think without the tool (below), the inference is abandoned every time a new user message appears.
Tool invocation schema using dedicated tokens
V4 introduces |DSML|. Special tokens and XML-based tool invocation format. The XML format reduces escaping failures compared to JSON-in-string tool calls, a common error mode when models output nested quoted content.
The schema separates string parameters (passed as is with string=”true”) from structured parameters (passed as JSON with string=”false”). This eliminates a series of parsing errors for numbers and boolean values
DSec: A sandbox built for RL rollouts
Agent behavior was trained using RL against a real tool environment. This document describes a sandbox infrastructure built for that purpose. DeepSeek Elastic Compute (DSec) is a Rust platform that exposes four execution infrastructures behind a single Python SDK: function calls, containers, microVM (Firecracker), and full VM (QEMU). Hundreds of thousands of sandboxes run simultaneously on a single cluster.
Three DSec features are important for agent training. Fast image loading via layered 3FS storage (RL rollout does not wait for container startup), preemption-safe trajectory replay (interrupted training steps are resumed without rerunning tool calls), and a uniform API across substrates (meaning that training can take advantage of target function calls or without rewriting the complete VM). These infrastructure decisions form the basis of an agent’s benchmark score.
Agent benchmark results
Knowledge and reasoning numbers are competitive, but not at the top. The agent number is where V4-Pro-Max is separated from the field.
Specific numbers in the agent section of Table 6:
Terminal Bench 2.0: V4-Pro-Max score is 67.9, ahead of GLM-5.1 (63.5) and K2.6 (66.7), and behind GPT-5.4-xHigh (75.1) and Gemini-3.1-Pro (68.5). SWE verified: 80.6 resolved and within points of Opus-4.6-Max (80.8) and Gemini-3.1-Pro (80.6). MCPAtlas Public: 73.6, second only to Opus-4.6-Max (73.8). Turathlon: 51.8, better than K2.6 (50.0), GLM-5.1 (40.7), and Gemini-3.1-Pro (48.8).
In the paper’s internal R&D coding benchmark, V4-Pro-Max achieved a 67% pass rate on 30 selected tasks across PyTorch, CUDA, Rust, and C++. This compared to 47% for Sonnet 4.5 and 70% for Opus 4.5. In a survey of 85 DeepSeek developers who use V4-Pro as their daily driver, 52% said they were ready to replace their current primary coding model with V4-Pro, with 39% leaning towards yes.
Figure 9 shows the number of long contexts obtained. The accuracy of the MRCR 8 needle remains above 0.82 up to 256K tokens and 0.59 at 1M.
Figure 9: MRCR 8 needle recovery. V4-Pro-Max stays above 0.82 up to 256K and 0.59 at 1M.
Using the model
The hub has four checkpoints. The instruction model uses FP4 for the MoE expert weights and FP8 for everything else. All base models are FP8.
Both instruction models support three modes of inference: They are No-Thinking (fast, no thought chains), Thinking Elevation (explicit reasoning within a block), and Maximum Thinking (maximum reasoning effort with dedicated system prompts). Think Max requires a context window of at least 384K tokens. Recommended sampling parameters for all modes are temperature = 1.0, top_p = 1.0.
V4-Pro’s numbers for SWE Verified, MCPAtlas, and internal R&D benchmarks are comparable to the frontier closed model in agent tasks. An open question is how the community tool harness will adapt to |DSML|. Schema and whether interleaved thoughts are transferred to an agent framework outside the domain.
The diagrams in this blog post are from the DeepSeek_V4.pdf technical report.