Earlier this year, we said we would provide computer usage capabilities to developers through the Gemini API. Today we are releasing the Gemini 2.5 computer usage model. This is a new specialized model built on Gemini 2.5 Pro’s visual understanding and reasoning capabilities that power agents that can interact with the user interface (UI). Outperforms leading alternatives on multiple web and mobile control benchmarks, all with lower latency. Developers can access these capabilities through Google AI Studio and Vertex AI’s Gemini API.
Although AI models can interact with software through structured APIs, many digital tasks still require direct interaction with graphical user interfaces, such as filling out and submitting forms. To complete these tasks, agents must interact with web pages and applications like humans by clicking, typing, and scrolling. The ability to natively fill out forms, interact with interactive elements like dropdowns and filters, and operate behind a login is an important next step in building powerful general-purpose agents.
structure
The core functionality of the model is exposed through the new `computer_use` tool in the Gemini API and must be manipulated within a loop. Inputs to the tool are user requests, screenshots of the environment, and a history of recent actions. In the input, you can also specify whether to exclude the function from the full list of supported UI actions or specify additional custom functions to include.

