Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

A lightweight mathematical reasoning agent using Smolagent

December 9, 2025

Google DeepMind’s Gemini 3 Pro image model

December 9, 2025

Instacart pilots agent commerce by embedding it into ChatGPT

December 8, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Tuesday, December 9
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»A lightweight mathematical reasoning agent using Smolagent
Tools

A lightweight mathematical reasoning agent using Smolagent

versatileaiBy versatileaiDecember 9, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

By Intel AI Software Group

DeepMath is a tuned mathematical reasoning agent built on Qwen3-4B Thinking and fine-tuned with GRPO (Group Relative Policy Optimization). Instead of redundant text, the model outputs small Python snippets of intermediate steps, runs in a secure sandbox, and wraps the results back into the inference, reducing errors and output length. The agent is implemented using the smolagents library.

We evaluate DeepMath on four mathematical datasets: MATH500, AIME, HMMT, and HLE and show that:

🤖 The math agent alone reduces output length by up to 66% and often improves accuracy.

⚡ GRPO training further improves agent performance on almost all benchmarks.

👉 Code and evaluation script: https://github.com/IntelLabs/DeepMath
👉 Model: https://huggingface.co/Intel/deepmath-v1

Why Deep Mass?

Although large-scale language models (LLMs) have advanced inference capabilities, mathematical problem solving remains challenging. The chain of thought trace becomes longer and more prone to arithmetic errors. Recent studies (^1)(^2) have demonstrated that even small models can deliver strong performance, and other studies (^3) have investigated the use of tools to improve reliability. What is generally not emphasized in these papers is reducing trace redundancy or explicitly training models to favor short computationally oriented traces that run in a constrained and auditable environment.

We focused on two goals:

Offload deterministic computations to secure performers.

Train your model to favor concise, computationally oriented traces over redundant text.

DeepMath tackles this by combining a small Python executor with a fine-tuned LLM to enable concise, computation-driven inference. The model learns how to generate short Python snippets. These snippets are run in the sandbox and reintegrated into the context. GRPO tweaks encourage this behavior by evaluating correctness and encouraging shorter outputs.

structure

Base model: Qwen3-4B thinking. Executor constraints: sandbox environment, allowed list of imported modules, timeouts per snippet. Reasoning: A mathematical agent was created based on Smolagent. vLLM is used as an inference engine. Training: Based on TRL’s GRPO trainer, we modified TRL’s vLLM client and server to generate GRPO completions using the DeepMath agent.

Changes to vLLM clients and servers in the TRL library.
Figure 1: The vLLM client and server have been modified to use the DeepMath agent for candidate generation while still using the vLLM backend.

Agent interface: During inference, the model can output special agent calls that include regular tokens or Python snippets.

Execution: The snippet runs in a sandbox environment with strict safety constraints (no file I/O, no networking, no timeouts).

Design goal:

Simplicity: Replace multi-line text calculations with short, focused snippets.

Determinism and safety: Enforce strict execution limits.

Interpretability: Snippets are readable and auditable.

Example output: Contains short Python snippets and output used in the inference process.
Figure 2: Example output where Python code is generated, evaluated, and the answer is inserted into the trace and used for context.

Training by GRPO

Fine-tune the model using GRPO, a reward-based optimization that balances:

Accuracy reward: +1 for correct answer.

Use code snippets: +1 for code snippet generation, 10:1 weighting for accuracy rewards.

Length reduction: Length reduction is encouraged by limiting GRPO completion candidates to 5,000 tokens.

Temperature scheduling: To balance exploration and stability during training, we implemented linear temperature scheduling (T=1.2 → T=0.7). This approach aims to increase experimentation during the initial training phase and then lower the temperature as skill proficiency increases.

In-context learning: The model learns syntax and call/response patterns because the trace contains four solved examples with agent calls and executor outputs.

Dataset: We used the Tool-Integrated Reasoning (TIR) ​​subset of the OpenMathReasoning dataset. Note that GRPO only uses the problem and not the solution in the data. This dataset was chosen to ensure that the problem would benefit from external tools.

evaluation

We benchmarked DeepMath against a baseline of four datasets. Metrics include:

Majority@16: See references for cross-sample robustness used in previous mathematical reasoning studies.

Average output length: brevity.

Main results table.

Compare the baseline configuration (Qwen3-4B-Thinking-2507, no agent) with the DeepMath model. As an ablation, we evaluate the developed agent framework by running an untrained Qwen3 model, denoted by +Agent. Additionally, we examine whether GRPO training (for agent use) improves non-agent inference (denoted by +GRPO). Therefore, the two ablations are independent and not additive.

We can see that agent inference reduces the output length and the accuracy of the results is variable. The DeepMath model is trained on GRPO and runs in agent mode, showing the highest accuracy with shortened traces. We conclude that both GRPO training and agent inference are required for best results.

Key Insight: DeepMath reduces output length by up to 66% while improving accuracy on difficult datasets.

why is it important

Accuracy: Offloading calculations reduces arithmetic errors.

Efficiency: Shorter outputs make inference faster and easier to interpret.

Safety: Sandbox execution reduces the risk of executing arbitrary code.

conclusion

DeepMath combines small executors with LLM and presents a practical and lightweight way to train models to favor short, computation-driven traces. Offloading deterministic computations reduces arithmetic and numerical errors, shortens traces, and fine-tuning GRPO further promotes concise and correct answers. The result is a mathematical solution agent that is more accurate and easier to interpret, without the need for large models or powerful external tools.

try it yourself

Check out our GitHub repository and share your feedback. Contributions are welcome. 🚀

quotation

If you use DeepMath in your research, please cite:

@software{deepmath2025, Author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, Title = {DeepMath: A Lightweight Mathematics Reasoning Agent for LLM}, Year = {2025}, Publisher = {Intel AI Labs}, URL = {https://github.com/IntelLabs/DeepMath} }

Limitations and future challenges

Scope: Focused on small models and mathematical reasoning.

Generalization: Assessed in contest style mathematics. The results may not be applicable to open-ended mathematical creativity or formal proofs.

Executing generated code is inherently risky. Although DeepMath uses strict sandboxing and resource limits, any deployment requires careful attack surface management and rate limiting.

References

(1) Luo, Michael, Sijun Tan, Justin Wong, et al. 2025. “DeepScaleR: Scaling RL to exceed O1 preview on 1.5B models.” https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2

(2) Liu, Mingjie, Shizhe Diao, Ximing Lu, et al. 2025. “ProRL: Long-term reinforcement learning extends the inference boundaries of large-scale language models.” arXiv:2505.24864. Preprint, arXiv, May 30th. https://doi.org/10.48550/arXiv.2505.24864

(3) Moshkov, Ivan, Darragh Hanley, Ivan Sorokin, et al. 2025. “AIMO-2 Award-Winning Solution: Building State-of-the-Art Mathematical Reasoning Models Using the OpenMathReasoning Dataset” arXiv:2504.16891. Preprint, arXiv, April 23. https://doi.org/10.48550/arXiv.2504.16891

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleGoogle DeepMind’s Gemini 3 Pro image model
versatileai

Related Posts

Tools

Google DeepMind’s Gemini 3 Pro image model

December 9, 2025
Tools

Instacart pilots agent commerce by embedding it into ChatGPT

December 8, 2025
Tools

How we achieved cutting-edge technology

December 8, 2025
Add A Comment

Comments are closed.

Top Posts

New image verification feature added to Gemini app

December 7, 20256 Views

Aluminum OS is the AI-powered successor to ChromeOS

December 7, 20255 Views

UK and Germany plan to commercialize quantum supercomputing

December 5, 20255 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

New image verification feature added to Gemini app

December 7, 20256 Views

Aluminum OS is the AI-powered successor to ChromeOS

December 7, 20255 Views

UK and Germany plan to commercialize quantum supercomputing

December 5, 20255 Views
Don't Miss

A lightweight mathematical reasoning agent using Smolagent

December 9, 2025

Google DeepMind’s Gemini 3 Pro image model

December 9, 2025

Instacart pilots agent commerce by embedding it into ChatGPT

December 8, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?