tl; dr
Over the past few weeks, we have been working tirelessly to make GUI agents more open, accessible and integrated. Along the way, I created the largest benchmark suite for GUI agent performance. Let’s introduce Screensuite.
Today we are very excited to share it with you. Screensuite is the most comprehensive and easiest way to evaluate Vision Language Models (VLMs) across many agent features!
Is WTF a GUI agent?
GUI Agent Inactive – Provided by Osworld
In short, AI agents are robots that operate in a virtual world. (More thorough definition here)
In particular, a “GUI agent” is an agent who lives in a GUI. Consider using Claude Computer, “An agent that you can click to navigate on your desktop or phone.”
This essentially means that AI models that power the AIG model will be given tasks such as “filling the rest of this Excel column” along with screen captures in the GUI. Use this information to decide to perform the action on the system: Click (x=130, y=540) to open a web browser. Scroll through type (XYZ value for 2025″), scroll (down = 2) to read more.
A good GUI agent can navigate your computer the same way we do, allowing you to unlock all your computer tasks. Scroll Google Maps, edit files, buy online items. This includes a variety of features that are difficult to evaluate.
Introducing Screensuite🥳
Literature, e.g. Xu et al. (2025) or Qin et al. (2025), in general, divides the capabilities of GUI agents between several categories.
Perception: Correctly perceive information displayed on screen ground: Understanding the location of the element – this is the most important thing to click on the correct location. Single Step Action: Resolve instructions correctly with one action multi-step agent.
Therefore, our first contribution is to collect and unify a comprehensive suite of 13 benchmarks across the entire range of these GUI agent features.
Looking at the last category above, evaluating multi-step agent functionality is particularly difficult as it requires virtual machines to run the agent environment, such as windows, androids, ubuntu, etc.
Implementation details
With modularity and consistency in mind, the benchmark suite was carefully designed to ensure strong alignment across tasks and environments. If necessary, especially online benchmarks, leverage Smolagents as a framework layer to streamline agent execution and orchestration.
To support reproducibility and ease of use, we have built a custom docucaized container that allows for local deployment of a complete Ubuntu desktop or Android environment.
Unlike many existing GUI benchmarks that rely on accessibility trees and other metadata, along with visual input, the stack is intentionally vision-only. This could lead to different scores on some established leaderboards, but we think it will create a more realistic and challenging setup.
– All agent frameworks (Android World, Osworld, Gaiaweb, Mind2Web) use Smoradient and rely solely on vision without adding an accessibility tree or DOM (as opposed to the evaluation settings reported in other sources). – Mind2Web (Multimodal) originally used element name-based multi-selection selection based on accessibility trees and screenshots, but later adapted to click on click accuracy within bounding boxes using only vision, greatly increasing the difficulty of the task.
Ranking major VLMs in screen sheets
We evaluated major VLMs in the benchmark
QWEN-2.5-VL series with models from 3B to 72B. These models are known for their incredible localization features. This means that it’s suitable for GUI agents that need to click accurately, as they know the coordinates of every element in the image. All-around by UI-TARS-1.5-7B, bytedance. The H HOLO1-7B is the latest model from the H Company and shows a highly performant localization of its size. GPT-4O
Our scores generally match the scores reported in a variety of sources! See the implementation details above, using the warning that it only evaluates the vision and causes some differences.

Please note that screensiont is not intending to accurately replicate any benchmarks published in the industry. Evaluate a model for vision-based GUI agent functionality. As a result, ratings are much more difficult with benchmarks like Mind2Web, where other benchmarks provide agents with information-rich context views such as DOM and accessibility trees.
Start custom evaluations in your 30s
Head to the repository.
Clone the repository in submodules: git clone-recurse-submodules git@github.com: install huggingface/screensuite.git package: uv sync-extra submodules – python run.py instead does not for grun in for model run_bentmarks for exchange.
MultiStep benchmark requires a bare metal machine to run and deploy the Desktop/Mobile *Environment *emulator (see readme.md)
Next Step 🚀
Performing consistent and meaningful assessments allows the community to quickly iterate and advance in this area, as we saw with Eleuther LM Evaluation Harness, Open LLM Leaderboard and Chatbot Arena.
We hope we will see a much more capable open model next month, where you can ensure you can perform a wide range of tasks and run them locally!
To support this effort: