Evaluation of tool-using agents in real-world environments

AI agents often perform well in controlled research settings, but struggle when deployed in real-world systems because they must reason across multiple steps, interact with real-world tools and APIs, operate on partial information, and recover from errors in a stateful, permissioned environment. This highlights the continuing gap between research success and operational reliability.

OpenEnv is an open source framework developed by Meta and Hugging Face that is designed to address this challenge by standardizing the way agents interact with their real-world environments. As part of this collaboration, Turing provided an operational-grade calendar management environment to study agents using tools under realistic constraints such as access control, temporal reasoning, and multi-agent coordination.

In this post, we explore how OpenEnv works in practice, why the calendar serves as a powerful benchmark for real-world agent evaluation, and what our findings reveal about the current limitations of agents using the tool.

What is OpenEnv?

OpenEnv is a framework for evaluating AI agents against real systems rather than simulations. It provides a standardized way to connect agents to real-world tools and workflows while maintaining the structure necessary for consistent and reliable assessments.

OpenEnv uses gym-oriented APIs like OpenAI’s Gymnasium (Reset, Step, Action, Observe). OpenEnv also connects to env using standard MCP tool invocation interfaces, providing a consistent interface to production environments across domains and simulations.

This environment maintains state across multiple actions, enables long-term inference, and can connect directly to real APIs and tools such as browsers, code repositories, and calendars. This changes the evaluation from “Does it work in a controlled demo?” “Will this work reliably in the real world?”

Calendar Gym: Production Grade Benchmark

The calendar system is deceptively complex. Scheduling a meeting may seem simple, but real-world calendar management requires agents to consider time, permissions, multiple users, incomplete information, and often requires multiple dependent steps. These properties make the calendar a powerful testbed for evaluating agents that use the tool outside of controlled simulations.

To ground OpenEnv in these kinds of real-world and demanding use cases, Turing built a production-grade calendar management environment called Calendar Gym. Rather than simulating abstract scheduling, we expose agents to the same constraints they face in real calendar systems. Access control lists across users and calendars, limited visibility into other users’ status, and multi-step workflows that require actions to be chained together in the correct order. Agents must navigate a rich set of calendar operations, from listing calendars to changing events and permissions, and handle failed actions, incorrect assumptions, and missing permissions. Each session runs in an isolated environment, allowing reliable comparisons between runs.

Below is a code example of how to use Calendar Gym. Explore the environment, discover available tools, list calendars, create events, and print results.

from openenv_wrapper.client import MCPEnvClient
from openenv_wrapper.data_models import MCP action

and MCPEnvClient.from_hub(base_url=“TuringEnterprises/Calendar Gym”) as client: result = client.reset()
print(“Reset successful:”result.observation.success) result = client.step(MCPAction(action_type=“List tool action”))
print(“Available tools:”, Ren(result.observation.tools_list)) result = client.step(MCCPAction( action_type=“Tool call action”tool name=“Calendar list”argument={} )) calendar = result.observation.tool_result(“item”)
print(“calendar:”calendar) result = client.step(MCPAction( action_type=“Tool call action”tool name=“Event insertion”argument={
“Calendar ID”: “major”,
“summary”: “Team synchronization”,
“start”: {“Date and time”: “2026-01-15T14:00:00Z”},
“end”: {“Date and time”: “2026-01-15T15:00:00Z”} } ))
print(“Event created:”result.observation.success)

Below is an excerpt of what Calendar Gym returns when you call ListToolsAction. Each entry includes the tool name and the input schema (arguments that the tool accepts).

Click to expand the output

{
“Tool list”: (
{
“name”: “Calendar list”,
“explanation”: “List the calendars visible to the current user.”,
“Input Schema”: {
“type”: “object”,
“Property”: {},
“Additional properties”: error
}
},
{
“name”: “Event insertion”,
“explanation”: “Create an event on your calendar.”,
“Input Schema”: {
“type”: “object”,
“Property”: {
“Calendar ID”: { “type”: “string” },
“summary”: { “type”: “string” },
“start”: {
“type”: “object”,
“Property”: { “Date and time”: { “type”: “string” } },
“Required”: (“Date and time”)
},
“end”: {
“type”: “object”,
“Property”: { “Date and time”: { “type”: “string” } },
“Required”: (“Date and time”)
}
},
“Required”: (“Calendar ID”, “summary”, “start”, “end”)
}
}
)
}

what we learned

Evaluating agents on Calendar Gym revealed consistent patterns across multiple domains. Agents often work well for individual actions, such as games, but become less reliable as tasks become longer, more ambiguous, and more constrained.

Multi-step inference is the main bottleneck. Agents struggle to chain actions correctly over longer workflows, and benchmarks suggest that continuous inference needs to be tested across multiple dependent steps, not just a single tool call.

Ambiguity significantly degrades performance. The agent achieved a nearly 90% success rate for tasks using explicit calendar identifiers, but the success rate dropped to about 40% when the same tasks were expressed using natural language descriptions. Rather than relying on LLM to resolve references, it seems essential to build strong search and validation into the agent loop.

Choosing the right tool is not enough. For failed interactions, more than half of the errors were due to malformed tool arguments or incorrect ordering, even when the correct tool was selected. Reliable agent behavior is highly dependent not only on tool selection but also on the quality of execution and structured feedback, and environment design is key.

These challenges are not unique to schedules and calendars. These reflect broader limitations that emerge whenever agents operate in systems that change over time, and point to permissions, partial observability, and evaluation frameworks that test multi-step workflows together.

For the future

OpenEnv provides a foundation for testing agents under realistic conditions, and Calendar Gym shows how seemingly simple domains can surface deep challenges in reasoning, ambiguity resolution, and tool usage. Evaluating agents with measurable failures and real-world constraints provides clearer insight into what is needed to build agents that work reliably in production.

To learn more about Calendar Gym’s design, benchmark methodology, and quantitative results, check out the full technical article on Turing’s site. To explore Calendar Gym clones, visit the Calendar Gym space.

Appendix: Common error cases when using tools

In reality, tool integration rarely fails dramatically. Small and predictable things fail. I encountered some recurring issues when connecting MCP tools to real APIs (such as calendar operations).

Specific error cases found in practice

Below are three common failure modes seen in production, along with typical error payloads and mitigation strategies. These examples show not only what can go wrong, but also how structured errors can help the agent recover successfully.

1. Schema validation error (missing or malformed argument)

The agent calls a valid tool (such as events_insert), but the arguments do not match the declared JSON schema.

Required fields such as CalendarId are missing Incorrect start/end nesting Passing a string where an object is expected. Click to expand the error payload

{
“got it”: error,
“Error type”: “Validation error”,
“Tool name”: “Event insertion”,
“message”: “Invalid arguments for tool ‘events_insert’. ”,
“detail”: {
“Required fields are missing”: (“Calendar ID”, “end”),
“Invalid field”: (
{
“Field”: “start”,
“expected_type”: “object”,
“Reception type”: “string”
}
)
}
}

You can alleviate this problem by providing one canonical example of a correct ‘events_insert’ call at the prompt. Returns structured validation errors so the model can be repaired and retried rather than failing silently.

2. Permission/Authorization Error (401/403)

Although the tool call is syntactically correct, it was rejected by the API due to insufficient privileges.

OAuth scope is missing Access token has expired User does not have write access to target calendar Click to expand error payload

{
“got it”: error,
“Error type”: “Permission_Error”,
“Tool name”: “Event insertion”,
“http_status”: 403,
“message”: “Authenticated user does not have write access to calendar ‘Primary’. ”,
“repair”: (
“Make sure your OAuth token includes a calendar write scope.”,
“Please make sure the user has edit access to the calendar.”,
“If the token expires, reconnect the integration.”
)
}

This can be mitigated by clearly documenting the required OAuth scopes. By returning structured, actionable remediation steps, agents can guide users instead of retrying the same failed call. Clearly document the required OAuth scopes. By returning structured, actionable remediation steps, agents can guide users instead of retrying the same failed call.

3. Date/time/format errors (RFC3339 and time zone issues)

The event was rejected by the API or created at an unexpected time.

Missing timezone offset Non-RFC3339 date/time format Incorrect nesting of start.dateTime or end.dateTime Mixing local time and UTC without specifying offset Click to expand error payload

{
“got it”: error,
“Error type”: “Format error”,
“Tool name”: “Event insertion”,
“message”: “The date and time format for field ‘start.dateTime’ is invalid.”,
“detail”: {
“Received”: “February 11, 2026, 9:30 a.m.”,
“Expected format”: “RFC3339 (Example 2026-02-11T09:30:00-05:00)”
}
}

This can be mitigated by standardizing in RFC3339 with an explicit timezone offset (e.g. 2026-02-11T09:30:00-05:00). Include at least one correct date and time example in your documentation to solidify model behavior and reduce repair retries.

versatileai

See Full Bio

What's Hot

Evaluation of tool-using agents in real-world environments

Google’s Gemma AI model helps discover new potential cancer treatment pathways

Going all-in on AI: What TikTok creator ByteDance did next

Google’s Gemma AI model helps discover new potential cancer treatment pathways

AI predictive models target healthcare resource efficiency

Custom kernels for everyone with Codex and Claude

CIO’s Governance Guide

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

Bridging the gap between AI agent benchmarks and industrial reality

Most Popular

CIO’s Governance Guide

NVIDIA powers local AI art generation with RTX-optimized ComfyUI workflow

Bridging the gap between AI agent benchmarks and industrial reality

Don't Miss

Evaluation of tool-using agents in real-world environments

Google’s Gemma AI model helps discover new potential cancer treatment pathways

Going all-in on AI: What TikTok creator ByteDance did next

Subscribe to Updates

What's Hot

Evaluation of tool-using agents in real-world environments

What is OpenEnv?

Calendar Gym: Production Grade Benchmark

what we learned

For the future

Appendix: Common error cases when using tools

Specific error cases found in practice

1. Schema validation error (missing or malformed argument)

2. Permission/Authorization Error (401/403)

3. Date/time/format errors (RFC3339 and time zone issues)

Related Posts