Today we share research that bridges two powerful paradigms of AI agent design. Code-based actions expressive and structured generation reliability. Our findings show that enforcing code argents to generate both thoughts and code in a structured JSON format can significantly outperform traditional approaches with multiple benchmarks.
Figure 1: Accuracy comparison of three approaches: Structured Code Argent (blue), Code Argent (orange), and Tool Carrigagent (grey) for Smolbench (Gaia, Math, SimpleQa, and Frame). Error bars represent 95% confidence intervals.
Agent Agent Action Evolution
AI agents need to take action in the world whether they call APIs, process data, or infer via complex problems. The way agents express these actions has evolved through several paradigms.
Traditional JSON Agent: The agent generates structured JSON and invokes the tool.
{“tool”: “get_weather”, “Discussion”: {“city”: “Paris”}}
These agents work by selecting from a list of predefined tools and generating calls in JSON format. This method of calling tool has been popularized by APIs that invoke Openai features and has since been the most widely used method to invoke tools.
It’s reliable, but limited.
A limited set of actions: The actions an agent can perform are expressed only in predefined tools that limit functionality. Lack of complexity: If a task needs to configure information from multiple sources, the JSON agent struggles due to lack of support to maintain intermediate state throughout the tool call. Some models support parallel tool invocations, but it is not easy to handle scenarios where the output of one tool determines the next action or compares the results and processes them. Rigid construction: Cases that do not match exactly what the tool needs are handled are very limited.
Code Agent: Agents use innate coding capabilities to write executable Python code directly.
Temperature_sum = 0
for city in (“Paris”, “Tokyo”, “new york”):temp = get_weather(city)heert_sum += temp
printing(f “Average temperature: {Temperature_sum / 3:.1f}°C “))
This shift, first presented in the paper “Executable Code Actions Elicit a Better LLM Agent”, gave AI agents the flexibility to write arbitrary executable Python code in addition to tool calls.
The key insight here is that tools are called directly from within the code, making variables and state management more reliable. Agents can call tools in loops, functions, and conditional statements – essentially generate a dynamic graph of tool execution with each action!
Using Codeagent:
Using Smart Tools: Agents decide which tools to use based on what is happening at the moment. Unlimited flexibility: You can use any Python feature to achieve your goals. Ability to test ideas: Agents can create and test hypotheses, increasing flexibility in their behavior
However, analysis codes from Markdown can be error-prone. This can lead to propositions. Would you like to use structured generation to generate code actions?
Add structured output to the CODE agent
Structured output allows you to generate explicit thoughts in LLM and force code to be generated as a JSON blob.
{
“thought”: “I want to find average temperatures in three cities.”,
“code”: “heert_sum = 0\nfor City in(\” paris\”,\” Tokyo\”,\” New York\”):\n temp=get_weather(city)\n gemert_sum +=temp\n\nprint(f\”Average temperature: {Temperature
}
The important difference is that generation is forced. Essentially, it’s not only about being encouraged to output thoughts now, but also about using structured output forces them to respect the structure.
This approach makes the most of both worlds as it adds structured generation reliability to the flexibility of code execution.
Explicit reasoning: Thought forces an agent to reason just before taking action. Reliable analysis: JSON structure eliminates markdown parsing errors Full code representation: Code fields maintain all the flexibility of code agents Better separation: Clear separation between planning and execution
🧪Benchmark Results
We compared these three paradigms across multiple benchmarks, including Gaia, Math, SimpleQA, and Frame. The results show a clear pattern. Code Action + Structured Generation consistently improves performance for competent models.
In most competent models, the structured approach always outperformed the regular code-argent approach, at an average of 2-7% points.
OpenAI model: Shows the greatest improvement in structure, especially in inference tasks Claude model: The benefits of structure, Claude 3.7 sonnets show particularly strong results.
Why structure structure (generally) is useful
The analysis problem is real
The Codeagent implementation in Smolagents extracts Python code from the LLM output.
Markdown code block formulation shows multiple code blocks incomplete or incorrectly formatted in a single response
Structured generation eliminates these problems with reliable JSON parsing.
To understand why structured generation is important, we analyzed 15,724 agent traces across benchmarks. The results are impressive:
2.4% of traces had parsing errors in the first call trace with first call analysis error: 42.3% success rate trace without first call analysis error: 51.3% success rate
Agent traces without parsing errors are 21.3% higher than agents with parsing errors.
This is not just convenience – parsing errors create a cascade of failures that have a significant impact on the overall agent performance. If the agent is unable to perform the first action due to malformed code, it often has trouble recovering, leading to a suboptimal problem-solving path.
Figure 2: Reduce the success rate of the analysis error agent at the first step by 21.3%, increasing the average step taken from 3.18 to 4.63.
Plus: the forced inference process
The use of structured production and explicit thinking not only encourages, but also forces agents to clarify reasoning before acting. Here is:
Better Plan: Agents think more systematically enhanced reliability through problems: explicit reasoning catches logic errors early
Structural tax
Our results also reveal clear competency thresholds. The model requires the ability to follow sufficient instructions and JSON coverage for pre-training data to benefit from structured generation. This suggests that a structured approach works best in:
Large, well-trained model model models with the ability to follow strong instruction are fine-tuned with structured generation.
When the structure is broken: a real example
When a small model (Mistralai/Mistral-7B-Instruct-V0.3) tries to generate structured code, the following happens: The cognitive load is getting too high.
{
“thought”: “You need to find the height…”,
“code”: “web_search(query = \” eiffel tower height \ “)\”, “
}
This model generates syntactically corrupt Python code: web_search(query=”eiffel tower height”)” – Note the malformed string with additional quotes and commas.
This indicates a “structural tax.” Small models struggle to simultaneously handle JSON formatting, Python syntax, and real problem-solving logic. The cognitive overhead of structured generation can overwhelm any model that performs reasonably, with simpler markdown-based code generation.
cortionWhen to use structured code argent
Using cortion: when:
Using a competent model (32B+ parameters or frontier model) task requires complex inference and code execution. Reliable analysis of agent output is required
⚠️Please consider the following cases:
A simple, predefined workflow using smaller models to combat structured generations is sufficient
How to use with Smolagents:
It’s very simple! Just enable it with use_structured_outputs_internally.
from Smoragents Import Codeagent, ImefenceClientModel, googleSearchTool agent = codeagent(Tools =(googleSearchTool(Provider =))“Celper”), Model = InderenceClientModel (“Qwen/qwen3-235b-a22b”provider =“Nebius”), use_structured_outputs_internally =truth
) result = agent.run(“Calculate how long a cheetah will run through the Golden Gate Bridge.”))
LLM generates something like this:
{
“thought”: “You need to find the length of the Golden Gate Bridge and the top speed of the cheetah, and then you need to calculate the time.”,
“code”: “Bridge_info = web_search(‘Golden Gate Bridge Length Meters’)\ncheetah_speed = web_search(‘Cheetah Top Speed’)…”
}
The “code” part is then executed by the agent as usual. This is a standard code argent, but now it’s 100% analysis reliable!
Implementation Tips
Clear Prompt: Make sure the prompt clearly specifies the selection of the JSON structured model you are expecting: Choose a model with strong structured generation capabilities. Select the appropriate provider. If you are using a reasoning provider with a face hug, the support for a structured generation will vary from provider to provider. Here is a list of providers that support structured generation:
The Big Picture – What’s next?
This study suggests that we are heading towards a more nuanced understanding of agent architecture. It’s not just “What can agents do?” But “What does the agent think about what it is doing?”
Perhaps making the inference process more clear will help the model stay on track. Or it might be easier to analyze. Either way, it’s a victory.
But this is just the beginning. There are so many questions left to explore:
What other structural improvements can be useful? How can I improve this task with various model architectures, especially the SMOL model? What does this tell us about the nature of AI reasoning?
For now, if you’re using Smolagents (or building your own Codeagent system), consider trying out structured output. Your parsing errors will appreciate you and you may just see a great increase in performance!