Open Responses is a new open inference standard. Started by OpenAI, built by the open source AI community, and supported by the Hugging Face ecosystem, Open Responses is built on the Responses API and designed for the future of agents. This blog post explains how Open Responses work and why the open source community should use them.
The era of chatbots is long gone, and agents are dominating inference workloads. Developers are moving toward autonomous systems that reason, plan, and execute with a long-term view. Despite this change, much of the ecosystem still uses chat completion formats. It is designed for turn-based conversations and is insufficient for agent use cases. Although response formats were designed to address these limitations, they are private and have not been widely adopted. Chat completion format remains the de facto standard, despite alternatives.
This mismatch between agent workflow requirements and established interfaces has created a need for open inference standards. Over the coming months, we will be working with the community and inference providers to implement and adapt Open Response to a shared format that can essentially replace chat input.
Open Responses builds on the direction OpenAI set with the Responses API, released in March 2025, replacing the existing Completion and Assistants APIs in a consistent manner:
Produce text, images, and JSON structured output. Create video content through another task-based endpoint. It runs an agent loop on the provider side, autonomously executes tool calls, and returns the final results.
What is an open response?
Open Responses extends and open sources the Responses API, allowing builders and routing providers to interoperate and collaborate based on common interests.
The key points are:
It is stateless by default and supports cryptographic inference for providers that require it. Standardized model configuration parameters. Streaming is modeled as a series of semantic events rather than raw text or object deltas. Extensible via configurable parameters specific to a particular model provider.
What do I need to know to build with Open Responses?
Here’s a quick overview of the major changes that will affect most community members. If you want to know more about the specs, please see the Open Responses documentation.
Client request to open response
Client requests to Open Response are similar to the existing Response API. Below is a request to the Open Responses API using curl. I’m calling a proxy endpoint that routes to an inference provider using the Open Responses API schema.
curl https://evalstate-openresponses.hf.space/v1/responses \ -H “Content-Type: application/json” \ -H “Authorization: Bearer $HF_TOKEN” \
+ -H “OpenResponses-Version: Latest” \
-N \ -d ‘{ “model”: “Moonshotai/Kimi-K2-Thinking:nebius”, “input”: “Explanation of life theory” }’
Inference client and provider changes
Clients that already support the Responses API can migrate to Open Responses with relatively little effort. Key changes include how inference content is published.
Enhanced inference visibility: Open Responses formalizes three optional fields in inference items: content (raw inference trace), encrypted_content (provider-specific protected content), and summary (sanitized from raw trace).
The OpenAI model was used to publish only summaries and encrypted content. Open Response allows providers to expose raw inference through an API. Clients migrating from providers that previously returned only summaries and encrypted content will now be able to receive and process raw inference streams if their chosen provider supports it.
Richer state change and payload implementations, including more granular observability. For example, a hosted code interpreter can send certain interpretation state to improve agent and user visibility during long-running operations.
If your model provider is already compliant with the Response API specification, implementing open response changes is easy. For routers, you have the opportunity to standardize on consistent endpoints and support configuration options for customization as needed.
Over time, as providers continue to innovate, certain features will become standardized in the base spec.
In summary, moving to Open Response makes your inference experience more consistent and improves quality by normalizing undocumented extensions, interpretations, and workarounds in the legacy Completions API in Open Response.
You can see how to stream inference chunks below.
{
“model”: “Gekshotai/Kimi-K2-Thinking:together”,
“input”: (
{
“type”: “message”,
“role”: “user”,
“content”: “Please explain photosynthesis.”
}
),
“stream”: truth
}
The difference between getting Open Response and using OpenAI Response for delta inference is:
event: response.reasoning.delta data: { “delta”: “User asks: ‘Where should I eat…’ Step 1: Parse the location…”, … }
event: response.reasoning_summary_text.delta data: { “delta”: “Determined users want restaurant recommendations.”, … }
Open response for routing
Open Responses distinguishes between “model providers” that provide inference, and “routers” that are intermediaries that coordinate between multiple providers.
Clients will now be able to specify a provider when making a request, along with provider-specific API options, allowing intermediate routers to coordinate requests between upstream providers.
tool
Open Responses natively supports two categories of tools: internal and external. Externally hosted tools are implemented outside of the model provider’s system. For example, the client-side functions performed and the MCP server. Internally hosted tools reside within the model provider’s system. Examples include OpenAI file search and Google Drive integration. This model calls, executes, and retrieves results entirely within the provider’s infrastructure and requires no developer intervention.
subagent loop
Open Responses formalizes agent loops, which typically consist of repeated cycles of inference, tool invocation, and response generation, allowing models to autonomously complete multi-step tasks.
Image source: openresponses.org
The loop works like this:
The API receives user requests and samples from the model When the model issues a tool call, the API executes it (internal or external) The results of the tool are fed back to the model for continued inference The loop repeats until the model signals completion
For internally hosted tools, the provider manages the entire loop. Run the tool, return results to the model, and stream the output. This means that one request is used in multi-step workflows such as “Search documents, summarize results, draft emails.”
The client controls the behavior of the loop via max_tool_calls to limit the iterations and uses tool_choice to limit the tools that can be called.
{
“model”: “zai-org/GLM-4.7”,
“input”: “Find Q3 sales data and email a summary to your team.”,
“tool”: (…),
“max_tool_calls”: 5,
“Select Tool”: “Auto”
}
The response includes all intermediate items, such as tool calls, results, and inferences.
next step
Open Responses extends and improves the Responses API to provide richer and more detailed content definition, compatibility, and deployment options. It also provides a standard way to run subagent loops during first-order inference calls, enabling powerful capabilities in AI applications. We look forward to working with the Open Responses team and the wider community to develop future specifications.

You can try Open Responses with the Hugging Face Inference provider today. We have an early access version available for use with Hugging Face Spaces. Try it today with our client and open response compliance tools.

