By Intel AI Software Group
DeepMath is a tuned mathematical reasoning agent built on Qwen3-4B Thinking and fine-tuned with GRPO (Group Relative Policy Optimization). Instead of redundant text, the model outputs small Python snippets of intermediate steps, runs in a secure sandbox, and wraps the results back into the inference, reducing errors and output length. The agent is implemented using the smolagents library.
We evaluate DeepMath on four mathematical datasets: MATH500, AIME, HMMT, and HLE and show that:
🤖 The math agent alone reduces output length by up to 66% and often improves accuracy.
⚡ GRPO training further improves agent performance on almost all benchmarks.
👉 Code and evaluation script: https://github.com/IntelLabs/DeepMath
👉 Model: https://huggingface.co/Intel/deepmath-v1
Why Deep Mass?
Although large-scale language models (LLMs) have advanced inference capabilities, mathematical problem solving remains challenging. The chain of thought trace becomes longer and more prone to arithmetic errors. Recent studies (^1)(^2) have demonstrated that even small models can deliver strong performance, and other studies (^3) have investigated the use of tools to improve reliability. What is generally not emphasized in these papers is reducing trace redundancy or explicitly training models to favor short computationally oriented traces that run in a constrained and auditable environment.
We focused on two goals:
Offload deterministic computations to secure performers.
Train your model to favor concise, computationally oriented traces over redundant text.
DeepMath tackles this by combining a small Python executor with a fine-tuned LLM to enable concise, computation-driven inference. The model learns how to generate short Python snippets. These snippets are run in the sandbox and reintegrated into the context. GRPO tweaks encourage this behavior by evaluating correctness and encouraging shorter outputs.
structure
Base model: Qwen3-4B thinking. Executor constraints: sandbox environment, allowed list of imported modules, timeouts per snippet. Reasoning: A mathematical agent was created based on Smolagent. vLLM is used as an inference engine. Training: Based on TRL’s GRPO trainer, we modified TRL’s vLLM client and server to generate GRPO completions using the DeepMath agent.

Figure 1: The vLLM client and server have been modified to use the DeepMath agent for candidate generation while still using the vLLM backend.
Agent interface: During inference, the model can output special agent calls that include regular tokens or Python snippets.
Execution: The snippet runs in a sandbox environment with strict safety constraints (no file I/O, no networking, no timeouts).
Design goal:
Simplicity: Replace multi-line text calculations with short, focused snippets.
Determinism and safety: Enforce strict execution limits.
Interpretability: Snippets are readable and auditable.

Figure 2: Example output where Python code is generated, evaluated, and the answer is inserted into the trace and used for context.
Training by GRPO
Fine-tune the model using GRPO, a reward-based optimization that balances:
Accuracy reward: +1 for correct answer.
Use code snippets: +1 for code snippet generation, 10:1 weighting for accuracy rewards.
Length reduction: Length reduction is encouraged by limiting GRPO completion candidates to 5,000 tokens.
Temperature scheduling: To balance exploration and stability during training, we implemented linear temperature scheduling (T=1.2 → T=0.7). This approach aims to increase experimentation during the initial training phase and then lower the temperature as skill proficiency increases.
In-context learning: The model learns syntax and call/response patterns because the trace contains four solved examples with agent calls and executor outputs.
Dataset: We used the Tool-Integrated Reasoning (TIR) subset of the OpenMathReasoning dataset. Note that GRPO only uses the problem and not the solution in the data. This dataset was chosen to ensure that the problem would benefit from external tools.
evaluation
We benchmarked DeepMath against a baseline of four datasets. Metrics include:
Majority@16: See references for cross-sample robustness used in previous mathematical reasoning studies.
Average output length: brevity.
Compare the baseline configuration (Qwen3-4B-Thinking-2507, no agent) with the DeepMath model. As an ablation, we evaluate the developed agent framework by running an untrained Qwen3 model, denoted by +Agent. Additionally, we examine whether GRPO training (for agent use) improves non-agent inference (denoted by +GRPO). Therefore, the two ablations are independent and not additive.
We can see that agent inference reduces the output length and the accuracy of the results is variable. The DeepMath model is trained on GRPO and runs in agent mode, showing the highest accuracy with shortened traces. We conclude that both GRPO training and agent inference are required for best results.
Key Insight: DeepMath reduces output length by up to 66% while improving accuracy on difficult datasets.
why is it important
Accuracy: Offloading calculations reduces arithmetic errors.
Efficiency: Shorter outputs make inference faster and easier to interpret.
Safety: Sandbox execution reduces the risk of executing arbitrary code.
conclusion
DeepMath combines small executors with LLM and presents a practical and lightweight way to train models to favor short, computation-driven traces. Offloading deterministic computations reduces arithmetic and numerical errors, shortens traces, and fine-tuning GRPO further promotes concise and correct answers. The result is a mathematical solution agent that is more accurate and easier to interpret, without the need for large models or powerful external tools.
try it yourself
Check out our GitHub repository and share your feedback. Contributions are welcome. 🚀
quotation
If you use DeepMath in your research, please cite:
@software{deepmath2025, Author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, Title = {DeepMath: A Lightweight Mathematics Reasoning Agent for LLM}, Year = {2025}, Publisher = {Intel AI Labs}, URL = {https://github.com/IntelLabs/DeepMath} }
Limitations and future challenges
Scope: Focused on small models and mathematical reasoning.
Generalization: Assessed in contest style mathematics. The results may not be applicable to open-ended mathematical creativity or formal proofs.
Executing generated code is inherently risky. Although DeepMath uses strict sandboxing and resource limits, any deployment requires careful attack surface management and rate limiting.
References
(1) Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Scaling RL to exceed O1 preview on 1.5B models.” https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2
(2) Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Long-term reinforcement learning extends the inference boundaries of large-scale language models.” arXiv:2505.24864. Preprint, arXiv, May 30th. https://doi.org/10.48550/arXiv.2505.24864
(3) Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Award-Winning Solution: Building State-of-the-Art Mathematical Reasoning Models Using the OpenMathReasoning Dataset” arXiv:2504.16891. Preprint, arXiv, April 23. https://doi.org/10.48550/arXiv.2504.16891

