TimeScope is an open source benchmark designed to measure how well a vision language model understands long videos. Evaluate three skills by adding short “needle” clips to videos ranging from 1 minute to 8 hours.
Localized search, information integration, fine-grained temporal perception. Timescope reveals that many cutting-edge models still struggle to understand true time.
table of contents
Recent advances in multimodal AI have generated models that claim to understand an hour of video. This trend reflects the advancements in long literal language models and is excellent at inference across long texts. This was followed by Vision-Language Systems to promote context windows that can handle thousands of frames. However, these claims require more details. Do these models really show an understanding of a set of events? Are they limited to surface-level search\recognition? It is important to ask whether their abilities are exaggerated.
Text benchmarks such as Helm and Ruler reveal vulnerabilities in long context claims, indicating that models often struggle when tasks require more than simple searches, such as inference and aggregation over long context lengths. However, in the video domain, I’m still catching up. The most common test, the haystack (Videoniah) video needle injects static images into the video as “needles,” effectively measure visual searches rather than true temporal dynamics. As a result, even top-tier models advertise large frame capacity are rarely trained beyond ~256 frames, and more pushes will show sharp drops on benchmarks like Video-MME.
This measurement gap makes us wonder: what does it really mean for a model to “understand” a long video? To address this, we are excited to introduce Timescope, a new open source benchmark hosted on Face. Timescope investigates the limitations of long video functionality by inserting several short (~5-10 seconds) video clips that are base videos ranging from 1 minute to 8 hours. Three different task types evaluate not only search but also synthesis, localization, and fine-grain motion analysis, providing a more holistic view of temporal understanding.
Why Timescope? Motivating better benchmarks for videos
The promises of long video AI are transformative. Agents can summarise time, detect subtle anomalies, and answer complex questions about the expanded narrative. Integrated in robotics, these models can analyze long-term operations, adapt in real time, and drive autonomous decision-making. Equally powerful is the vision of a personal assistant who understands everyday life and provides continuous, practical feedback.
In reality, this leads to an exaggerated feature. You might argue that the model handles over 10,000 frames, but training data is often capped at 256 frames per clip, resulting in lower performance with longer inputs. This has been seen in evaluations that increase the altitude of frame sampling rates for tasks that require temporary insights.
Timescope flips the script by highlighting the three pillars of understanding long videos.
Localized Search: Can a model find and answer questions about a particular short segment within a vast video? Information Integration: Can I collect and order details from multiple points across my timeline? Fine temporal perception: Can you analyze needle movements and events requiring dense multi-frame sampling?
Benchmark design
A key idea in Timescope is to use short video clips as “needles,” and not only find the needle, but also push the model to deepen the entire video. Start with a long base video (such as a documentary, lecture, or ambient footage) and insert one or more hand-by-hand short video needles (5-10 seconds each) in a random position. These needles contain the important information needed to resolve the task, and force the model to process the entire input without shortcuts such as sparse sampling.
Figure 1: Overview of Timescope needle insertion process. The long base video (1 minute to 8 hours) serves as a haystack, splicing a short video needle (~5-10 seconds) into it. The task requires the detection, synthesis, or analysis of content from these needles embedded in different depths.
It is evaluated across three needle types, each targeting different aspects of long distance understanding.
1. Localized Search
This tests your basic search and understanding of localized events. I have a question as to whether sampling the relevant frame from the needle is sufficient.
example:
What modes of transport are shown in the video?
2. Information integration
Here we embedded multiple text-based needles (e.g. 2-4 short clips displaying “secret words” via text on screen) at various points in the video. The model must identify all words and report them in chronological order. You need to simulate tasks such as extracting timestamps and extracting important facts from distributed scenes. This requires scanning the full timeline and understanding relative positioning.
3. Temporal perception of fine grains
For questions focusing on motion or sequences within a short clip, single frame sampling will not cut it. The model needs to perceive dynamics between frames. This will investigate whether long context processing retains temporary fidelity.
example:
How many times did the man swing his x? (a) 1 (b) 2 (c) 3 (d) 4 (e) 5 (f) 6
With different video lengths and different needle placement, TimeScope measures how much video your model can actually process, indicating that performance decreases as the video gets longer.
Ratings and Leaderboards
To get things started, I ran timescopes on a range of major vision language models, from open source favorites to juggernauts like Gemini 2.5-pro. The results highlight the value of the benchmark. Even models who claim to handle long videos well still struggle with actual long video tasks. These findings reveal clear patterns – performance cliffs around a particular period, weaknesses in static search and motion analysis, and paving the way for target improvement in model training. For detailed results and visualization, see the embedding embedding face spaces above.
What did we learn?
Model sizes are not all. The QWEN 2.5-VL 3B and 7B, and the INTENVL 2.5 model with 2B, 4B, and 8B parameters exhibit long, indistinguishable video curves for smaller counterparts. They are all plateaus with roughly the same context length, simply indicating that the scaling parameters do not automatically grant longer temporal horizons.
The Gemini 2.5-Pro is in its own league. This is the only model that maintains powerful accuracy over an hour of video.
Task-wide trade-offs are important. Qwen 2.5-VL shines in Information Synthesis (OCR) tasks (distributed text snippet identification and ordering).
Conclusion – Raise the long video AI bar
Timescopes show that “an hour of video understanding” is still a slogan than reality. By revealing where even cutting-edge models stumble with temporary inference, information integration, and motion recognition, benchmarks invite you to rethink how to train and evaluate multimodal systems.
Run the demo – Explore public spaces: https://huggingface.co/spaces/apollo-lmms/timescope benchmark local – Evaluate any model with two quick commands. Leaderboard – Submit your score to see the model comparison.
We hope this benchmark will help the community make steady and measurable progress towards a model that will better understand video over time.
Open source all the components of TimeScope.