Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

1 million tokens of real-world context for agents to use

April 24, 2026

Reduce AI inference costs with NVIDIA and Google infrastructure

April 23, 2026

Gemma 4 VLA Demo for Jetson Orin Nano Super

April 23, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, April 24
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»1 million tokens of real-world context for agents to use
Tools

1 million tokens of real-world context for agents to use

versatileaiBy versatileaiApril 24, 2026No Comments8 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Blue blueprint drawing of a whale with dashed measurement lines and technical schematics in a grid background.
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

DeepSeek released V4 today. Two MoE checkpoints are located on the hub. DeepSeek-V4-Pro (total parameters 1.6T, active 49B, DeepSeek-V4-Flash, total 284B, active 13B). Both have a 1M token context window. Benchmark numbers are competitive, but not SOTA. it doesn’t matter. The real innovation is that DeepSeek v4 is designed to efficiently support large context lengths and is therefore one of the best candidates for agent tasks.

Focus on long-running agent workloads. Currently, running a frontier open model as an agent causes problems in a predictable manner. The model will stop. You re-prompt. Tracing exceeds the context budget, KV cache fills up the GPU, and tool call round trips degrade in the middle of long tasks. V4 is built to fix these known roadblocks and point the way for the community to follow.

This post will cover three things. One is what changes the architecture is making to make long-context inference cheaper, plus the agent-specific post-training decisions that add complexity, and some points from the paper to help you reason about these changes.

Agent KV cache issue

1M context window is just capacity, not performance. Whether you can use it depends on the cost of every forward pass at that depth. For agents running long tool usage trajectories (SWE bench tasks, multi-step browse sessions, terminal sessions with hundreds of commands), all tool results are added to the context, and all subsequent tokens pay the full attention cost for everything that occurred previously.

Two numbers are important: FLOP for single token inference and KV cache size. Both increase with sequence length. For 1 million tokens, DeepSeek-V4-Pro requires 27% of single-token inference FLOPs compared to DeepSeek-V3.2, so it runs faster on the same hardware. It also uses 10% of the KV cache memory. V4-Flash reduces these numbers even further, with a 10% reduction in FLOPs and a 7% reduction in KV cache.

Comparing KV cache memory to established architectures such as 8-head grouped query attention stored in regular bfloat16 format, DeepSeek v4 requires approximately 2% less cache size. This makes it much easier to deploy context processing at very large scale.

Figure 1 from the DeepSeek-V4 technical report, benchmark on the left, inference FLOP and KV cache scaling on the right.
Figure 1: Benchmark comparison (left), per-token FLOP and accumulated KV cache (right) against sequence length.

Hybrid Attention: CSA and HCA

Efficiency is improved by splitting the attention into two mechanisms and interleaving them between layers.

Compressed Sparse Attention (CSA) uses softmax gate pooling with learned positional bias to compress KV entries by a factor of 4 along the sequence dimension. Lightning indexer (FP4, multihead dot product of ReLU scores) selects the top k compressed blocks for each query. It inherits the idea of sparse selection from DeepSeek Sparse Attendant in V3.2, but it is performed on blocks that are already 4 times shorter than the original sequence. The indexer’s search space also shrinks accordingly.

Figure 3: Showing compressed sparse attention, compressor on compressed blocks, lightning indexer, and sliding window branches
Figure 3: CSA. The compressor collapses every four tokens into one compressed KV entry. The Lightning indexer selects the top k compressed blocks for each query. The sliding window branch processes the latest uncompressed token.

Highly Compressed Attention (HCA) compresses KV entries by a factor of 128 and removes sparse selections. All queries serve all compressed blocks densely. Compressed sequences are short enough that focused attention is cheap.

Figure 4: High compression attention, 128x compression with dense MQA on compressed blocks
Figure 4: HCA. A heavier compressor (128x vs. 4x) is followed by intensive attention to the compressed stream, using the same sliding window branch for recency.

The layers alternate between CSA and HCA. Different layers have different attention patterns, and forcing one mechanism for all layers wastes capacity. In V4-Pro’s 61-layer stack, layers 0-1 are HCA, layers 2-60 are alternating between CSA and HCA, and the last MTP block runs only in a sliding window.

Both paths use FP8 storage for most KV entries and only BF16 for RoPE dimensions. Lightning indexers within CSA run in FP4. These storage choices and compression ratios combine to yield a 2% KV cache number.

Figure 2: Overall architecture showing embedding, hybrid CSA/HCA attention, DeepSeekMoE, and manifold-constrained hyperconnections.
Figure 2: Overall architecture. The attention layer alternates between CSA and HCA. The feedforward layer uses DeepSeekMoE. The remaining connections are replaced by manifold-constrained hyperconnections (mHCs).

Changes for agents

Agent workflows require efficient long context attention, but it is not sufficient. This paper describes three post-training infrastructure options that directly target agent use cases.

Thoughts interleaved across tool calls

In V3.2, the inference trace continued throughout the tool results round, but the trace was discarded each time a new user message arrived. For agents handling a single user turn, this was fine. For multi-turn agent workflows where the user sends a follow-up after the agent has already chained multiple tool calls, the model lost the accumulated inferences and had to rebuild its state.

In V4, inferences are preserved across user message boundaries when conversations include tool calls. The model maintains a complete inference history across all rounds, including the entire user turn. This allows for a consistent and cumulative chain of thought across the agent’s tasks over time. When used in a conversation without tools, the old behavior is maintained. The reasoning is flushed after each turn to keep the context concise.

Figure 7: Thought management using tools (top) to preserve reasoning across turns. Without tools (bottom) discards the inference on each new user message.
Figure 7: Thinking with tools (top) maintains reasoning across all turns. If we think without the tool (below), the inference is abandoned every time a new user message appears.

Tool invocation schema using dedicated tokens

V4 introduces |DSML|. Special tokens and XML-based tool invocation format. The XML format reduces escaping failures compared to JSON-in-string tool calls, a common error mode when models output nested quoted content.

The schema separates string parameters (passed as is with string=”true”) from structured parameters (passed as JSON with string=”false”). This eliminates a series of parsing errors for numbers and boolean values that routinely occur with JSON tool invocation formats.

DSec: A sandbox built for RL rollouts

Agent behavior was trained using RL against a real tool environment. This document describes a sandbox infrastructure built for that purpose. DeepSeek Elastic Compute (DSec) is a Rust platform that exposes four execution infrastructures behind a single Python SDK: function calls, containers, microVM (Firecracker), and full VM (QEMU). Hundreds of thousands of sandboxes run simultaneously on a single cluster.

Three DSec features are important for agent training. Fast image loading via layered 3FS storage (RL rollout does not wait for container startup), preemption-safe trajectory replay (interrupted training steps are resumed without rerunning tool calls), and a uniform API across substrates (meaning that training can take advantage of target function calls or without rewriting the complete VM). These infrastructure decisions form the basis of an agent’s benchmark score.

Agent benchmark results

Knowledge and reasoning numbers are competitive, but not at the top. The agent number is where V4-Pro-Max is separated from the field.

DeepSeek-V4-Pro-Max benchmark comparison between Frontier models

Specific numbers in the agent section of Table 6:

Terminal Bench 2.0: V4-Pro-Max score is 67.9, ahead of GLM-5.1 (63.5) and K2.6 (66.7), and behind GPT-5.4-xHigh (75.1) and Gemini-3.1-Pro (68.5). SWE verified: 80.6 resolved and within points of Opus-4.6-Max (80.8) and Gemini-3.1-Pro (80.6). MCPAtlas Public: 73.6, second only to Opus-4.6-Max (73.8). Turathlon: 51.8, better than K2.6 (50.0), GLM-5.1 (40.7), and Gemini-3.1-Pro (48.8).

In the paper’s internal R&D coding benchmark, V4-Pro-Max achieved a 67% pass rate on 30 selected tasks across PyTorch, CUDA, Rust, and C++. This compared to 47% for Sonnet 4.5 and 70% for Opus 4.5. In a survey of 85 DeepSeek developers who use V4-Pro as their daily driver, 52% said they were ready to replace their current primary coding model with V4-Pro, with 39% leaning towards yes.

Figure 9 shows the number of long contexts obtained. The accuracy of the MRCR 8 needle remains above 0.82 up to 256K tokens and 0.59 at 1M.

Figure 9: MRCR 8 needle retrieval performance over context lengths up to 1M tokens
Figure 9: MRCR 8 needle recovery. V4-Pro-Max stays above 0.82 up to 256K and 0.59 at 1M.

Using the model

The hub has four checkpoints. The instruction model uses FP4 for the MoE expert weights and FP8 for everything else. All base models are FP8.

Both instruction models support three modes of inference: They are No-Thinking (fast, no thought chains), Thinking Elevation (explicit reasoning within a block), and Maximum Thinking (maximum reasoning effort with dedicated system prompts). Think Max requires a context window of at least 384K tokens. Recommended sampling parameters for all modes are temperature = 1.0, top_p = 1.0.

V4-Pro’s numbers for SWE Verified, MCPAtlas, and internal R&D benchmarks are comparable to the frontier closed model in agent tasks. An open question is how the community tool harness will adapt to |DSML|. Schema and whether interleaved thoughts are transferred to an agent framework outside the domain.

The diagrams in this blog post are from the DeepSeek_V4.pdf technical report.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleReduce AI inference costs with NVIDIA and Google infrastructure
versatileai

Related Posts

Tools

Reduce AI inference costs with NVIDIA and Google infrastructure

April 23, 2026
Tools

Gemma 4 VLA Demo for Jetson Orin Nano Super

April 23, 2026
Tools

Snowflake extends technical and mainstream AI platforms

April 22, 2026
Add A Comment

Comments are closed.

Top Posts

Run VLM on Intel CPUs in 3 easy steps

October 18, 20254 Views

Diffusers welcome stable spread 3

April 26, 20254 Views

Quality First Arabic LLM Leaderboard

April 21, 20263 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Run VLM on Intel CPUs in 3 easy steps

October 18, 20254 Views

Diffusers welcome stable spread 3

April 26, 20254 Views

Quality First Arabic LLM Leaderboard

April 21, 20263 Views
Don't Miss

1 million tokens of real-world context for agents to use

April 24, 2026

Reduce AI inference costs with NVIDIA and Google infrastructure

April 23, 2026

Gemma 4 VLA Demo for Jetson Orin Nano Super

April 23, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?