Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

ClarityCut ​​AI unveils a new creative engine for branded videos

June 7, 2025

The most comprehensive evaluation suite for GUI agents!

June 7, 2025

Japan’s innovative approach to artificial intelligence law – gktoday

June 7, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Saturday, June 7
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»The most comprehensive evaluation suite for GUI agents!
Tools

The most comprehensive evaluation suite for GUI agents!

versatileaiBy versatileaiJune 7, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

tl; dr

Over the past few weeks, we have been working tirelessly to make GUI agents more open, accessible and integrated. Along the way, I created the largest benchmark suite for GUI agent performance. Let’s introduce Screensuite.

Today we are very excited to share it with you. Screensuite is the most comprehensive and easiest way to evaluate Vision Language Models (VLMs) across many agent features!

Is WTF a GUI agent?

GUI Agent Inactive – Provided by Osworld

In short, AI agents are robots that operate in a virtual world. (More thorough definition here)

In particular, a “GUI agent” is an agent who lives in a GUI. Consider using Claude Computer, “An agent that you can click to navigate on your desktop or phone.”

This essentially means that AI models that power the AIG model will be given tasks such as “filling the rest of this Excel column” along with screen captures in the GUI. Use this information to decide to perform the action on the system: Click (x=130, y=540) to open a web browser. Scroll through type (XYZ value for 2025″), scroll (down = 2) to read more.

A good GUI agent can navigate your computer the same way we do, allowing you to unlock all your computer tasks. Scroll Google Maps, edit files, buy online items. This includes a variety of features that are difficult to evaluate.

Introducing Screensuite🥳

Literature, e.g. Xu et al. (2025) or Qin et al. (2025), in general, divides the capabilities of GUI agents between several categories.

Perception: Correctly perceive information displayed on screen ground: Understanding the location of the element – this is the most important thing to click on the correct location. Single Step Action: Resolve instructions correctly with one action multi-step agent.

Therefore, our first contribution is to collect and unify a comprehensive suite of 13 benchmarks across the entire range of these GUI agent features.

Looking at the last category above, evaluating multi-step agent functionality is particularly difficult as it requires virtual machines to run the agent environment, such as windows, androids, ubuntu, etc.

Implementation details

With modularity and consistency in mind, the benchmark suite was carefully designed to ensure strong alignment across tasks and environments. If necessary, especially online benchmarks, leverage Smolagents as a framework layer to streamline agent execution and orchestration.

To support reproducibility and ease of use, we have built a custom docucaized container that allows for local deployment of a complete Ubuntu desktop or Android environment.

Unlike many existing GUI benchmarks that rely on accessibility trees and other metadata, along with visual input, the stack is intentionally vision-only. This could lead to different scores on some established leaderboards, but we think it will create a more realistic and challenging setup.

– All agent frameworks (Android World, Osworld, Gaiaweb, Mind2Web) use Smoradient and rely solely on vision without adding an accessibility tree or DOM (as opposed to the evaluation settings reported in other sources). – Mind2Web (Multimodal) originally used element name-based multi-selection selection based on accessibility trees and screenshots, but later adapted to click on click accuracy within bounding boxes using only vision, greatly increasing the difficulty of the task.

Ranking major VLMs in screen sheets

We evaluated major VLMs in the benchmark

QWEN-2.5-VL series with models from 3B to 72B. These models are known for their incredible localization features. This means that it’s suitable for GUI agents that need to click accurately, as they know the coordinates of every element in the image. All-around by UI-TARS-1.5-7B, bytedance. The H HOLO1-7B is the latest model from the H Company and shows a highly performant localization of its size. GPT-4O

Our scores generally match the scores reported in a variety of sources! See the implementation details above, using the warning that it only evaluates the vision and causes some differences.

Please note that screensiont is not intending to accurately replicate any benchmarks published in the industry. Evaluate a model for vision-based GUI agent functionality. As a result, ratings are much more difficult with benchmarks like Mind2Web, where other benchmarks provide agents with information-rich context views such as DOM and accessibility trees.

Start custom evaluations in your 30s

Head to the repository.

Clone the repository in submodules: git clone-recurse-submodules git@github.com: install huggingface/screensuite.git package: uv sync-extra submodules – python run.py instead does not for grun in for model run_bentmarks for exchange.

MultiStep benchmark requires a bare metal machine to run and deploy the Desktop/Mobile *Environment *emulator (see readme.md)

Next Step 🚀

Performing consistent and meaningful assessments allows the community to quickly iterate and advance in this area, as we saw with Eleuther LM Evaluation Harness, Open LLM Leaderboard and Chatbot Arena.

We hope we will see a much more capable open model next month, where you can ensure you can perform a wide range of tasks and run them locally!

To support this effort:

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleJapan’s innovative approach to artificial intelligence law – gktoday
Next Article ClarityCut ​​AI unveils a new creative engine for branded videos
versatileai

Related Posts

Tools

Humanity launches Claude AI model for US national security

June 7, 2025
Tools

Reddit appeals to humanity over AI data scraping

June 6, 2025
Tools

AI enables the transition from enablement to strategic leadership

June 5, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

Gemini 2.5 Pro Preview: Even better coding performance

May 13, 20254 Views

New Star: Discover why 보니 is the future of AI art

February 26, 20254 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Deepseek’s latest AI model is a “big step back” for free speech

May 31, 20255 Views

Gemini 2.5 Pro Preview: Even better coding performance

May 13, 20254 Views

New Star: Discover why 보니 is the future of AI art

February 26, 20254 Views
Don't Miss

ClarityCut ​​AI unveils a new creative engine for branded videos

June 7, 2025

The most comprehensive evaluation suite for GUI agents!

June 7, 2025

Japan’s innovative approach to artificial intelligence law – gktoday

June 7, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?