Earlier this year, he said it was providing computer usage capabilities to developers through the Gemini API. Today we are releasing a Gemini 2.5 computer-use model. This enhances the agent that can interact with the user interface (UIS) with a new specialized model built on the visual understanding and inference capabilities of Gemini 2.5 Pro. It is better than the major alternatives on multiple web and mobile control benchmarks, all with delays. Developers can access these features through Google AI Studio and Vertex AI’s Gemini API.
AI models can interface with software via structured APIs, but many digital tasks require direct interaction with the graphical user interface, such as filling and submitting forms. To complete these tasks, agents must navigate web pages and applications, as humans do. By clicking, typing, scrolling. Fill in the form natively, manipulate interactive elements such as dropdowns and filters, and the functionality behind login is an important next step in building a powerful, generic agent.
How it works
The core functionality of the model is exposed through the new `Computer_use` tool in the Gemini API and must be manipulated within a loop. Inputs to the tool are user requests, screenshots of the environment, and history of recent actions. The input can also specify whether to exclude functions from the complete list of supported UI actions or to specify additional custom functions to include.

