Better way to perform an action

Today we share research that bridges two powerful paradigms of AI agent design. Code-based actions expressive and structured generation reliability. Our findings show that enforcing code argents to generate both thoughts and code in a structured JSON format can significantly outperform traditional approaches with multiple benchmarks.

Figure 1: Accuracy comparison of three approaches: Structured Code Argent (blue), Code Argent (orange), and Tool Carrigagent (grey) for Smolbench (Gaia, Math, SimpleQa, and Frame). Error bars represent 95% confidence intervals.

Agent Agent Action Evolution

AI agents need to take action in the world whether they call APIs, process data, or infer via complex problems. The way agents express these actions has evolved through several paradigms.

Traditional JSON Agent: The agent generates structured JSON and invokes the tool.

{“tool”: “get_weather”, “Discussion”: {“city”: “Paris”}}

These agents work by selecting from a list of predefined tools and generating calls in JSON format. This method of calling tool has been popularized by APIs that invoke Openai features and has since been the most widely used method to invoke tools.

It’s reliable, but limited.

A limited set of actions: The actions an agent can perform are expressed only in predefined tools that limit functionality. Lack of complexity: If a task needs to configure information from multiple sources, the JSON agent struggles due to lack of support to maintain intermediate state throughout the tool call. Some models support parallel tool invocations, but it is not easy to handle scenarios where the output of one tool determines the next action or compares the results and processes them. Rigid construction: Cases that do not match exactly what the tool needs are handled are very limited.

Code Agent: Agents use innate coding capabilities to write executable Python code directly.

Temperature_sum = 0
for city in (“Paris”, “Tokyo”, “new york”):temp = get_weather(city)heert_sum += temp

printing(f “Average temperature: {Temperature_sum / 3:.1f}°C “))

This shift, first presented in the paper “Executable Code Actions Elicit a Better LLM Agent”, gave AI agents the flexibility to write arbitrary executable Python code in addition to tool calls.

The key insight here is that tools are called directly from within the code, making variables and state management more reliable. Agents can call tools in loops, functions, and conditional statements – essentially generate a dynamic graph of tool execution with each action!

Using Codeagent:

Using Smart Tools: Agents decide which tools to use based on what is happening at the moment. Unlimited flexibility: You can use any Python feature to achieve your goals. Ability to test ideas: Agents can create and test hypotheses, increasing flexibility in their behavior

However, analysis codes from Markdown can be error-prone. This can lead to propositions. Would you like to use structured generation to generate code actions?

Add structured output to the CODE agent

Structured output allows you to generate explicit thoughts in LLM and force code to be generated as a JSON blob.

{
“thought”: “I want to find average temperatures in three cities.”,
“code”: “heert_sum = 0\nfor City in(\” paris\”,\” Tokyo\”,\” New York\”):\n temp=get_weather(city)\n gemert_sum +=temp\n\nprint(f\”Average temperature: {Temperature
}

The important difference is that generation is forced. Essentially, it’s not only about being encouraged to output thoughts now, but also about using structured output forces them to respect the structure.

This approach makes the most of both worlds as it adds structured generation reliability to the flexibility of code execution.

Explicit reasoning: Thought forces an agent to reason just before taking action. Reliable analysis: JSON structure eliminates markdown parsing errors Full code representation: Code fields maintain all the flexibility of code agents Better separation: Clear separation between planning and execution

🧪Benchmark Results

We compared these three paradigms across multiple benchmarks, including Gaia, Math, SimpleQA, and Frame. The results show a clear pattern. Code Action + Structured Generation consistently improves performance for competent models.

In most competent models, the structured approach always outperformed the regular code-argent approach, at an average of 2-7% points.

OpenAI model: Shows the greatest improvement in structure, especially in inference tasks Claude model: The benefits of structure, Claude 3.7 sonnets show particularly strong results.

Why structure structure (generally) is useful

The analysis problem is real

The Codeagent implementation in Smolagents extracts Python code from the LLM output.

Markdown code block formulation shows multiple code blocks incomplete or incorrectly formatted in a single response

Structured generation eliminates these problems with reliable JSON parsing.

To understand why structured generation is important, we analyzed 15,724 agent traces across benchmarks. The results are impressive:

2.4% of traces had parsing errors in the first call trace with first call analysis error: 42.3% success rate trace without first call analysis error: 51.3% success rate

Agent traces without parsing errors are 21.3% higher than agents with parsing errors.

This is not just convenience – parsing errors create a cascade of failures that have a significant impact on the overall agent performance. If the agent is unable to perform the first action due to malformed code, it often has trouble recovering, leading to a suboptimal problem-solving path.

Figure 2: Reduce the success rate of the analysis error agent at the first step by 21.3%, increasing the average step taken from 3.18 to 4.63.

Plus: the forced inference process

The use of structured production and explicit thinking not only encourages, but also forces agents to clarify reasoning before acting. Here is:

Better Plan: Agents think more systematically enhanced reliability through problems: explicit reasoning catches logic errors early

Structural tax

Our results also reveal clear competency thresholds. The model requires the ability to follow sufficient instructions and JSON coverage for pre-training data to benefit from structured generation. This suggests that a structured approach works best in:

Large, well-trained model model models with the ability to follow strong instruction are fine-tuned with structured generation.

When the structure is broken: a real example

When a small model (Mistralai/Mistral-7B-Instruct-V0.3) tries to generate structured code, the following happens: The cognitive load is getting too high.

{
“thought”: “You need to find the height…”,
“code”: “web_search(query = \” eiffel tower height \ “)\”, “
}

This model generates syntactically corrupt Python code: web_search(query=”eiffel tower height”)” – Note the malformed string with additional quotes and commas.

This indicates a “structural tax.” Small models struggle to simultaneously handle JSON formatting, Python syntax, and real problem-solving logic. The cognitive overhead of structured generation can overwhelm any model that performs reasonably, with simpler markdown-based code generation.

cortionWhen to use structured code argent

Using cortion: when:

Using a competent model (32B+ parameters or frontier model) task requires complex inference and code execution. Reliable analysis of agent output is required

⚠️Please consider the following cases:

A simple, predefined workflow using smaller models to combat structured generations is sufficient

How to use with Smolagents:

It’s very simple! Just enable it with use_structured_outputs_internally.

from Smoragents Import Codeagent, ImefenceClientModel, googleSearchTool agent = codeagent(Tools =(googleSearchTool(Provider =))“Celper”), Model = InderenceClientModel (“Qwen/qwen3-235b-a22b”provider =“Nebius”), use_structured_outputs_internally =truth
) result = agent.run(“Calculate how long a cheetah will run through the Golden Gate Bridge.”))

LLM generates something like this:

{
“thought”: “You need to find the length of the Golden Gate Bridge and the top speed of the cheetah, and then you need to calculate the time.”,
“code”: “Bridge_info = web_search(‘Golden Gate Bridge Length Meters’)\ncheetah_speed = web_search(‘Cheetah Top Speed’)…”
}

The “code” part is then executed by the agent as usual. This is a standard code argent, but now it’s 100% analysis reliable!

Implementation Tips

Clear Prompt: Make sure the prompt clearly specifies the selection of the JSON structured model you are expecting: Choose a model with strong structured generation capabilities. Select the appropriate provider. If you are using a reasoning provider with a face hug, the support for a structured generation will vary from provider to provider. Here is a list of providers that support structured generation:

The Big Picture – What’s next?

This study suggests that we are heading towards a more nuanced understanding of agent architecture. It’s not just “What can agents do?” But “What does the agent think about what it is doing?”

Perhaps making the inference process more clear will help the model stay on track. Or it might be easier to analyze. Either way, it’s a victory.

But this is just the beginning. There are so many questions left to explore:

What other structural improvements can be useful? How can I improve this task with various model architectures, especially the SMOL model? What does this tell us about the nature of AI reasoning?

For now, if you’re using Smolagents (or building your own Codeagent system), consider trying out structured output. Your parsing errors will appreciate you and you may just see a great increase in performance!

versatileai

See Full Bio

What's Hot

Doudna Supercomputer to Strengthen AI and Genomics Research

Promote your creativity with new generation media models and tools

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

Promote your creativity with new generation media models and tools

Odyssey’s AI model transforms videos into an interactive world

Gemini 2.5 update from Google Deepmind

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

New report on national security risks from weakened AI safety frameworks

Most Popular

The UAE announces bold AI-led plans to revolutionize the law

The UAE will use artificial intelligence to develop new laws

New report on national security risks from weakened AI safety frameworks

Don't Miss

Doudna Supercomputer to Strengthen AI and Genomics Research

Promote your creativity with new generation media models and tools

From California to Kentucky: Tracking the rise of state AI laws in 2025 | White & Case LLP

Subscribe to Updates

What's Hot

Better way to perform an action

Agent Agent Action Evolution

Add structured output to the CODE agent

🧪Benchmark Results

Why structure structure (generally) is useful

The analysis problem is real

Structural tax

When the structure is broken: a real example

cortionWhen to use structured code argent

How to use with Smolagents:

Implementation Tips

The Big Picture – What’s next?

Related Posts