Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Orrick Attorney General Update | January 2026 | Orrick, Herrington & Sutcliffe LLP

February 1, 2026

Google DeepMind brings AI to the next generation of fusion energy — Google DeepMind

February 1, 2026

Use of AI in Travelers Soars as the Role of Call Centers Decrease

January 31, 2026
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, February 1
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»A lightweight mathematical reasoning agent using Smolagent
Tools

A lightweight mathematical reasoning agent using Smolagent

versatileaiBy versatileaiDecember 9, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

By Intel AI Software Group

DeepMath is a tuned mathematical reasoning agent built on Qwen3-4B Thinking and fine-tuned with GRPO (Group Relative Policy Optimization). Instead of redundant text, the model outputs small Python snippets of intermediate steps, runs in a secure sandbox, and wraps the results back into the inference, reducing errors and output length. The agent is implemented using the smolagents library.

We evaluate DeepMath on four mathematical datasets: MATH500, AIME, HMMT, and HLE and show that:

🤖 The math agent alone reduces output length by up to 66% and often improves accuracy.

⚡ GRPO training further improves agent performance on almost all benchmarks.

👉 Code and evaluation script: https://github.com/IntelLabs/DeepMath
👉 Model: https://huggingface.co/Intel/deepmath-v1

Why Deep Mass?

Although large-scale language models (LLMs) have advanced inference capabilities, mathematical problem solving remains challenging. The chain of thought trace becomes longer and more prone to arithmetic errors. Recent studies (^1)(^2) have demonstrated that even small models can deliver strong performance, and other studies (^3) have investigated the use of tools to improve reliability. What is generally not emphasized in these papers is reducing trace redundancy or explicitly training models to favor short computationally oriented traces that run in a constrained and auditable environment.

We focused on two goals:

Offload deterministic computations to secure performers.

Train your model to favor concise, computationally oriented traces over redundant text.

DeepMath tackles this by combining a small Python executor with a fine-tuned LLM to enable concise, computation-driven inference. The model learns how to generate short Python snippets. These snippets are run in the sandbox and reintegrated into the context. GRPO tweaks encourage this behavior by evaluating correctness and encouraging shorter outputs.

structure

Base model: Qwen3-4B thinking. Executor constraints: sandbox environment, allowed list of imported modules, timeouts per snippet. Reasoning: A mathematical agent was created based on Smolagent. vLLM is used as an inference engine. Training: Based on TRL’s GRPO trainer, we modified TRL’s vLLM client and server to generate GRPO completions using the DeepMath agent.

Changes to vLLM clients and servers in the TRL library.
Figure 1: The vLLM client and server have been modified to use the DeepMath agent for candidate generation while still using the vLLM backend.

Agent interface: During inference, the model can output special agent calls that include regular tokens or Python snippets.

Execution: The snippet runs in a sandbox environment with strict safety constraints (no file I/O, no networking, no timeouts).

Design goal:

Simplicity: Replace multi-line text calculations with short, focused snippets.

Determinism and safety: Enforce strict execution limits.

Interpretability: Snippets are readable and auditable.

Example output: Contains short Python snippets and output used in the inference process.
Figure 2: Example output where Python code is generated, evaluated, and the answer is inserted into the trace and used for context.

Training by GRPO

Fine-tune the model using GRPO, a reward-based optimization that balances:

Accuracy reward: +1 for correct answer.

Use code snippets: +1 for code snippet generation, 10:1 weighting for accuracy rewards.

Length reduction: Length reduction is encouraged by limiting GRPO completion candidates to 5,000 tokens.

Temperature scheduling: To balance exploration and stability during training, we implemented linear temperature scheduling (T=1.2 → T=0.7). This approach aims to increase experimentation during the initial training phase and then lower the temperature as skill proficiency increases.

In-context learning: The model learns syntax and call/response patterns because the trace contains four solved examples with agent calls and executor outputs.

Dataset: We used the Tool-Integrated Reasoning (TIR) ​​subset of the OpenMathReasoning dataset. Note that GRPO only uses the problem and not the solution in the data. This dataset was chosen to ensure that the problem would benefit from external tools.

evaluation

We benchmarked DeepMath against a baseline of four datasets. Metrics include:

Majority@16: See references for cross-sample robustness used in previous mathematical reasoning studies.

Average output length: brevity.

Main results table.

Compare the baseline configuration (Qwen3-4B-Thinking-2507, no agent) with the DeepMath model. As an ablation, we evaluate the developed agent framework by running an untrained Qwen3 model, denoted by +Agent. Additionally, we examine whether GRPO training (for agent use) improves non-agent inference (denoted by +GRPO). Therefore, the two ablations are independent and not additive.

We can see that agent inference reduces the output length and the accuracy of the results is variable. The DeepMath model is trained on GRPO and runs in agent mode, showing the highest accuracy with shortened traces. We conclude that both GRPO training and agent inference are required for best results.

Key Insight: DeepMath reduces output length by up to 66% while improving accuracy on difficult datasets.

why is it important

Accuracy: Offloading calculations reduces arithmetic errors.

Efficiency: Shorter outputs make inference faster and easier to interpret.

Safety: Sandbox execution reduces the risk of executing arbitrary code.

conclusion

DeepMath combines small executors with LLM and presents a practical and lightweight way to train models to favor short, computation-driven traces. Offloading deterministic computations reduces arithmetic and numerical errors, shortens traces, and fine-tuning GRPO further promotes concise and correct answers. The result is a mathematical solution agent that is more accurate and easier to interpret, without the need for large models or powerful external tools.

try it yourself

Check out our GitHub repository and share your feedback. Contributions are welcome. 🚀

quotation

If you use DeepMath in your research, please cite:

@software{deepmath2025, Author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, Title = {DeepMath: A Lightweight Mathematics Reasoning Agent for LLM}, Year = {2025}, Publisher = {Intel AI Labs}, URL = {https://github.com/IntelLabs/DeepMath} }

Limitations and future challenges

Scope: Focused on small models and mathematical reasoning.

Generalization: Assessed in contest style mathematics. The results may not be applicable to open-ended mathematical creativity or formal proofs.

Executing generated code is inherently risky. Although DeepMath uses strict sandboxing and resource limits, any deployment requires careful attack surface management and rate limiting.

References

(1) Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Scaling RL to exceed O1 preview on 1.5B models.” https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2

(2) Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Long-term reinforcement learning extends the inference boundaries of large-scale language models.” arXiv:2505.24864. Preprint, arXiv, May 30th. https://doi.org/10.48550/arXiv.2505.24864

(3) Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Award-Winning Solution: Building State-of-the-Art Mathematical Reasoning Models Using the OpenMathReasoning Dataset” arXiv:2504.16891. Preprint, arXiv, April 23. https://doi.org/10.48550/arXiv.2504.16891

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleThe future of PR is about automated workflows, not faster content creation – Unite.AI
Next Article Fal secures $140 million to power real-time AI-generated content
versatileai

Related Posts

Tools

Google DeepMind brings AI to the next generation of fusion energy — Google DeepMind

February 1, 2026
Tools

Use of AI in Travelers Soars as the Role of Call Centers Decrease

January 31, 2026
Tools

Chain apps programmatically and visually inspect them

January 31, 2026
Add A Comment

Comments are closed.

Top Posts

Deloitte’s agent AI guide highlights governance

January 28, 202611 Views

CIO’s Governance Guide

January 22, 20265 Views

The future of PR is about automated workflows, not faster content creation – Unite.AI

December 9, 20255 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Deloitte’s agent AI guide highlights governance

January 28, 202611 Views

CIO’s Governance Guide

January 22, 20265 Views

The future of PR is about automated workflows, not faster content creation – Unite.AI

December 9, 20255 Views
Don't Miss

Orrick Attorney General Update | January 2026 | Orrick, Herrington & Sutcliffe LLP

February 1, 2026

Google DeepMind brings AI to the next generation of fusion energy — Google DeepMind

February 1, 2026

Use of AI in Travelers Soars as the Role of Call Centers Decrease

January 31, 2026
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2026 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?