How good is LLMS in a text-based video game?

Clémentine Forfried avatar

The rapid advances in large-scale language models (LLMS) have enabled significant advances in established academic and industrial benchmarks. Knowledge benchmarks such as MMLU and GPQA are currently in near saturation, and frontier models have made great strides in the evaluation of experts like HLE. However, this success in static knowledge-based tasks is not always translated into the effectiveness of dynamic, interactive configurations, which are environments that improve performance for effective assistants and AI agents. Developing robust methodologies for assessing LLM as an autonomous agent in complex and exploratory environments remains an important challenge.

Two core avenues exist to evaluate autonomous agents. Use real environments and limited specific skills such as tool use and coding capabilities, or use simulated open world environments. The latter better captures the ability of agents to operate autonomously in exploratory environments that require sustained and spontaneous reasoning across long-growing contexts, while being easier to evaluate. This direction is still developing, but interest has been growing through benchmarks such as Balrog and Arc-Agi and demonstrations of models such as the Claude and Gemini that play Pokémon. This emerging work introduces TextQuests.

TextQuests

TextQuests is a benchmark based on 25 classic interactive fiction games. These once popular text-based video games can take human players for over 30 hours, require hundreds of accurate actions to solve, and provide an attractive testbed for agent reasoning challenges. They ask the agent to show:

Long contextual reasoning: Agents should devise and execute multi-step plans by inferring a long, continuous, growing history of actions and observations, relying solely on essential functions without the help of external tools.

Learning through Exploration: In the game, agents need to learn from experience, interrogate their own mistakes, and make progressive improvements through trial and error, exploring the unknown world.

Success in these games requires agents to build understanding in lengthy gameplay sessions. This allows for a more direct and accurate evaluation of the LLM itself as the inference backbone of the AI agent system.

Text watermark — An example showing the diverse reasoning challenges of text quests.

evaluation

For each model, we run two different evaluation runs. One accesses official game tips (including clues) and the other (no clues). Each run runs in a maximum of 500 steps, and will stop early once the agent completes the game successfully. To handle growing contexts, the full game history is maintained without truncating throughout the run. This long contest evaluation is computationally feasible due to the rapid cache specific to modern LLM inference frameworks. Use two main evaluation metrics:

Game progress. Game progress metrics are calculated based on a set of labeled checkpoints that represent the required objectives on the path to complete the game.

harm. To assess the ethical behavior of an agent, we measure harm by tracking certain in-game actions that are considered to be harmful to some extent. This score is averaged across all games and evaluates the overall tendency of agents performing such actions.

Discussion

Long contest reasoning. During evaluation, the context window can exceed 100K tokens, and LLMS must consistently implement accurate inferences and plans to effectively advance the vast history of observations and cues. As context length grows, we find that the current model often hallucinates about previous interactions. For example, if an item has not already picked up an item, or you believe it is navigating in a loop. Furthermore, similar to Gemini 2.5 observations, Pokémon, LLM agents, have increased inclination to repeat actions from history rather than consolidating new plans as contexts grow. The obstacles to these long contests are particularly demanding on tasks that require spatial reasoning. For example, in Wishbringer, most LLMS had a hard time getting back to the bottom of the cliff after climbing. This solution simply requires inverting the sequence of directions used in ascending order – information available in the context history – indicating the fundamental difficulties in building and using mental maps. Similarly, all frontier LLMS struggle to navigate the infamous maze of Zork I.

Dynamic thinking. The overall effectiveness of an agent is defined by both the success of the task and operational efficiency. For LLM agents, efficiency is closely tied to the number of outputs or inference tokens that are generated, and directly affects inference costs and delays. Models that utilize more test time calculations generally provide better performance. However, this trend begins to decline after a certain budget. This consideration is important as many exploratory steps (e.g., navigation steps) in a text quest are intermediate and can be performed successfully without any major inference depth.

thought — Comparison of output and inference token efficiency across cutting-edge LLMs on text quests. Because many exploratory procedures are intermediate and do not require a complete inference budget, ideal LLM agents must be more efficient and dynamic in reasoning efforts while maintaining consistent performance.

Finally, TextQuests is to assess how well a model can advance through a series of classic interactive fiction games that were once popular among human players. We hope that open-sourcing text quests will help researchers to better understand and evaluate the current capabilities of LLM agents in challenging exploration environments. Open source model builders can submit to the TextQuests leaderboard by emailing at agibenchmark@safe.ai.

Quote

@misc {phan2025textquestsgoodllmstext based, title={textquests: How good is LLMS in Text-based video games? }, author = {long phan and andy zou and dan hendrycks}, year = {2025}, {2507.23701} archiveprefix = {arxiv}, primaryclass = {cs.ai}, url = {https://arxiv.org/abs/2507.23701}, }

versatileai

See Full Bio

What's Hot

EU AI adoption delays China amid regulatory hurdles

Pennsylvania bill will require minors to report AI deepfakes

Why AI Phishing Detection Defines Cybersecurity in 2026

EU AI adoption delays China amid regulatory hurdles

Why AI Phishing Detection Defines Cybersecurity in 2026

Amazon Sagemaker’s Llamas 2 benchmark

Large-scale trust: the key to business-enabled agent AI

AI Art Generators like Piclumen Transform Digital Archeology and Creative Industries 2025 | AI News Details

Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

Most Popular

Large-scale trust: the key to business-enabled agent AI

AI Art Generators like Piclumen Transform Digital Archeology and Creative Industries 2025 | AI News Details

Accelerated depth pronune draft model for the QWEN3-8B agent from Intel® Core™ Ultra

Don't Miss

EU AI adoption delays China amid regulatory hurdles

Pennsylvania bill will require minors to report AI deepfakes

Why AI Phishing Detection Defines Cybersecurity in 2026

Subscribe to Updates

What's Hot

How good is LLMS in a text-based video game?

TextQuests

evaluation

Discussion

Quote

Related Posts