Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

EU AI adoption delays China amid regulatory hurdles

October 5, 2025

Pennsylvania bill will require minors to report AI deepfakes

October 5, 2025

Why AI Phishing Detection Defines Cybersecurity in 2026

October 4, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Sunday, October 5
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
  • Resources
Versa AI hub
Home»Tools»How good is LLMS in a text-based video game?
Tools

How good is LLMS in a text-based video game?

versatileaiBy versatileaiAugust 13, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email



Clémentine Forfried avatar

The rapid advances in large-scale language models (LLMS) have enabled significant advances in established academic and industrial benchmarks. Knowledge benchmarks such as MMLU and GPQA are currently in near saturation, and frontier models have made great strides in the evaluation of experts like HLE. However, this success in static knowledge-based tasks is not always translated into the effectiveness of dynamic, interactive configurations, which are environments that improve performance for effective assistants and AI agents. Developing robust methodologies for assessing LLM as an autonomous agent in complex and exploratory environments remains an important challenge.

Two core avenues exist to evaluate autonomous agents. Use real environments and limited specific skills such as tool use and coding capabilities, or use simulated open world environments. The latter better captures the ability of agents to operate autonomously in exploratory environments that require sustained and spontaneous reasoning across long-growing contexts, while being easier to evaluate. This direction is still developing, but interest has been growing through benchmarks such as Balrog and Arc-Agi and demonstrations of models such as the Claude and Gemini that play Pokémon. This emerging work introduces TextQuests.

TextQuests logo

TextQuests

TextQuests is a benchmark based on 25 classic interactive fiction games. These once popular text-based video games can take human players for over 30 hours, require hundreds of accurate actions to solve, and provide an attractive testbed for agent reasoning challenges. They ask the agent to show:

Long contextual reasoning: Agents should devise and execute multi-step plans by inferring a long, continuous, growing history of actions and observations, relying solely on essential functions without the help of external tools.

Learning through Exploration: In the game, agents need to learn from experience, interrogate their own mistakes, and make progressive improvements through trial and error, exploring the unknown world.

Success in these games requires agents to build understanding in lengthy gameplay sessions. This allows for a more direct and accurate evaluation of the LLM itself as the inference backbone of the AI agent system.

An example showing the diverse reasoning challenges of text quests.

Text watermark

evaluation

For each model, we run two different evaluation runs. One accesses official game tips (including clues) and the other (no clues). Each run runs in a maximum of 500 steps, and will stop early once the agent completes the game successfully. To handle growing contexts, the full game history is maintained without truncating throughout the run. This long contest evaluation is computationally feasible due to the rapid cache specific to modern LLM inference frameworks. Use two main evaluation metrics:

Game progress. Game progress metrics are calculated based on a set of labeled checkpoints that represent the required objectives on the path to complete the game.

harm. To assess the ethical behavior of an agent, we measure harm by tracking certain in-game actions that are considered to be harmful to some extent. This score is averaged across all games and evaluates the overall tendency of agents performing such actions.

TextQuests LLMS performance.
result

Discussion

Long contest reasoning. During evaluation, the context window can exceed 100K tokens, and LLMS must consistently implement accurate inferences and plans to effectively advance the vast history of observations and cues. As context length grows, we find that the current model often hallucinates about previous interactions. For example, if an item has not already picked up an item, or you believe it is navigating in a loop. Furthermore, similar to Gemini 2.5 observations, Pokémon, LLM agents, have increased inclination to repeat actions from history rather than consolidating new plans as contexts grow. The obstacles to these long contests are particularly demanding on tasks that require spatial reasoning. For example, in Wishbringer, most LLMS had a hard time getting back to the bottom of the cliff after climbing. This solution simply requires inverting the sequence of directions used in ascending order – information available in the context history – indicating the fundamental difficulties in building and using mental maps. Similarly, all frontier LLMS struggle to navigate the infamous maze of Zork I.

inference

Examples of long contextual reasoning obstacles in text quests. Left: On Zork I, we fantastically tested that LLMS was unable to properly recall information from its history and dropped a matchbook in the studio rather than the Atlantis room. Right: In Wishbringer, LLMS often get their ascent paths from history within the context and reverse them, navigate the cliffs well.

Dynamic thinking. The overall effectiveness of an agent is defined by both the success of the task and operational efficiency. For LLM agents, efficiency is closely tied to the number of outputs or inference tokens that are generated, and directly affects inference costs and delays. Models that utilize more test time calculations generally provide better performance. However, this trend begins to decline after a certain budget. This consideration is important as many exploratory steps (e.g., navigation steps) in a text quest are intermediate and can be performed successfully without any major inference depth.

thought

Comparison of output and inference token efficiency across cutting-edge LLMs on text quests. Because many exploratory procedures are intermediate and do not require a complete inference budget, ideal LLM agents must be more efficient and dynamic in reasoning efforts while maintaining consistent performance.

Finally, TextQuests is to assess how well a model can advance through a series of classic interactive fiction games that were once popular among human players. We hope that open-sourcing text quests will help researchers to better understand and evaluate the current capabilities of LLM agents in challenging exploration environments. Open source model builders can submit to the TextQuests leaderboard by emailing at agibenchmark@safe.ai.

Quote

@misc {phan2025textquestsgoodllmstext based, title={textquests: How good is LLMS in Text-based video games? }, author = {long phan and andy zou and dan hendrycks}, year = {2025}, {2507.23701} archiveprefix = {arxiv}, primaryclass = {cs.ai}, url = {https://arxiv.org/abs/2507.23701}, }

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleSoundhound gives its AI the power of vision
Next Article AI Art Generation Using Primo Models: Unlock Cosmic Visuals for Digital Creators | AI News Details
versatileai

Related Posts

Tools

EU AI adoption delays China amid regulatory hurdles

October 5, 2025
Tools

Why AI Phishing Detection Defines Cybersecurity in 2026

October 4, 2025
Tools

Amazon Sagemaker’s Llamas 2 benchmark

October 4, 2025
Add A Comment

Comments are closed.

Top Posts

Large-scale trust: the key to business-enabled agent AI

September 30, 20253 Views

AI Art Generators like Piclumen Transform Digital Archeology and Creative Industries 2025 | AI News Details

September 30, 20253 Views

Accelerated depth pronune draft model for the QWEN3-8B ​​agent from Intel® Core™ Ultra

October 1, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Large-scale trust: the key to business-enabled agent AI

September 30, 20253 Views

AI Art Generators like Piclumen Transform Digital Archeology and Creative Industries 2025 | AI News Details

September 30, 20253 Views

Accelerated depth pronune draft model for the QWEN3-8B ​​agent from Intel® Core™ Ultra

October 1, 20252 Views
Don't Miss

EU AI adoption delays China amid regulatory hurdles

October 5, 2025

Pennsylvania bill will require minors to report AI deepfakes

October 5, 2025

Why AI Phishing Detection Defines Cybersecurity in 2026

October 4, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?