CinePile 2.0 – Creating more powerful datasets with adversarial refinement

In this blog post, we’ll share our journey towards releasing CinePile 2.0, a much improved version of our long video QA dataset. The new dataset refinement relies on a new approach: adversarial dataset refinement.

We are pleased to share both CinePile 2.0 and our implementation of the adversarial refinement method. We believe this can enhance many existing datasets and be directly incorporated as part of future dataset creation pipelines.

If you are primarily interested in adversarial refinement methods, you can jump directly to the “Adversarial Refinement” section.

hang on. What is Cinepile?

In May 2024, we launched CinePile, a long video QA dataset with approximately 300,000 training samples and 5,000 test samples.

The first release stood out from other datasets in two ways:

Variety of questions: Covers temporal understanding, plot analysis, character dynamics, setting, and theme. Question difficulty: In our benchmarks, humans outperformed the best commercial vision models by 25% and outperformed open source vision models by 65%.

Take a look at the data sample

Part of the secret sauce behind it is that it relies on YouTube movie clips and Q&A pulled from accurate audio descriptions designed for visually impaired viewers. These descriptions provide rich context beyond the basic visuals (e.g., “What color is the car?”) and help you formulate more complex questions.

sample scene

Please tell me more. How was the original dataset assembled?

To automate question creation, we first built question templates by inspecting existing datasets such as MovieQA and TVQA. We cluster questions in these datasets using the text similarity model WhereIsAI/UAE-Large-V1 and use 10 random examples from each cluster to create question templates and I have generated a prototype question.

Categories Question Templates Typical Questions Character and Relationship Dynamics (CRD) Interpersonal Dynamics What changes occur in the relationship between A and B after they share an experience or behavior?Character and Relationship Dynamics (CRD) Justification of the Decision What reasons did the characters give for making the decision?Narrative and Plot Analysis (NPA) Crisis Events What major events led the characters to take drastic action?Story and Plot analysis (NPA) Mysteries Revealed What secrets does Character A reveal about Event B? Setting and Technical Analysis (STA) What physical possessions (character name) does he have? ?What are the configuration and technical analysis (STA) environment details (configuration/location) (duration/time) (specific time/location/event)?Temporal (TEMP) Time-sensitive critical actions ( character) What must be done immediately? What are the consequences of not doing so? Temporal (Temp) Frequency How many times does the character attempt (Action A)? Thematic Exploration (TH) Tracking Symbolism and Motifs Do any symbols or motifs introduced in Scene A reappear or evolve in Scene B? And what do they mean? Thematic Exploration (TH) Thematic similarities What is the confusion in the scene similar to in terms of the film’s themes?

Since templates are not necessarily relevant to every movie clip, we used Gemini 1.0 Pro to select the most appropriate template for each scene. Next, enter the language model with the text for the scene, the template name of your choice (such as “Physical Possession”), sample questions, and system prompts to create scene-specific questions. Well-designed prompts help the model focus on the whole scene, generating deeper questions while avoiding superficial questions. I found out the following:

Prevent GPT-4 hallucinations by providing prototype examples and including timestamps in dialogue and visual descriptions. This approach leads to more plausible multiple-choice questions (MCQ) distractors. Improving the quality of your questions by asking your model to provide evidence for its answers.

Using this approach, approximately 32 questions are generated per video. Before releasing CinePile, we implemented several mechanisms to ensure the quality of our datasets/benchmarks. This will be explained in the next section.

Inspect the quality of the first result

While our process typically produces well-formed, answerable questions, some questions are trivial or rely on fundamental concepts that don’t require viewing the clip. To address this, we used several large-scale language models (LLMs) to identify and filter three types of problems:

degeneracy problem

A question is considered “degenerate” if the answer is obvious from the question itself (e.g., “What color is the pink house?”). These are just a small part of the dataset. Manual review was not possible at our size. ,We adopted three LLMs (Gemini, GPT-3.5, and Phi-1.5) for,automatic detection. If all three models answered correctly without context, the question was removed from the evaluation set.

Problems with visual dependence

Some multiple-choice questions do not require visual information and can be answered using only dialogue. We used the Gemini model to determine whether a question could be answered using dialogue alone. Questions were given a binary score: 0 if answerable without visual information, 1 if visual information was required.

Difficulty rating

To assess the difficulty of the questions, we tested whether the model could answer correctly even when given the full context (both visual description and subtitles).

Through continued use of the benchmark by our team and the broader community, we have identified several areas of improvement that led us to consider CinePile 2.0.

Cine Pile 2.0

For the second release of CinePile, we worked with Hugging Face to identify and prioritize several areas for improvement (following successful experiments tweaking Video Llava 7B on CinePile) .

CinePile 1.0 issues

Degenerate filtering was useful in CinePile 1.0, but it had some limitations.

Some questions could be answered using just Q&A pairs, without requiring transcripts or visual content Many of the flagged questions included valuable insights from the video. – instead of discarding them, they could have been reworded to better capture the value Degeneracy checks were limited to the test set: running multiple models (especially your own models) , CinePile 1.0 training It was too expensive for the set to scale.

To address these issues, we’ve introduced a new Adversarial Refinement pipeline that helps you improve weak questions instead of just discarding them. This approach is easier to apply at scale. Throughout this post, we will refer to the model that identifies degenerate questions (using only question and answer choices, without visual or interactive information) as the “Deafblind LLM.”

hostile improvements

The Adversarial Refinement pipeline aims to modify questions or answers until the DeafBlind LLM cannot easily predict the correct answer. Here’s how it works:

Deaf-Blind LLM provides both answers and rationales to explain its choices based solely on the question. These rationales can help you identify implicit cues and biases embedded in your questions. Our question generation model uses this rationale to modify questions and answer choices. Removing Implicit Cues This process is repeated up to five times for each question until the deafblind LLM’s performance randomly decreases.

Generated to refined QA example

Given the computational demands of this iterative process, we needed a powerful and accessible LLM that could be run locally to avoid API usage limitations, delays, and costs of cloud services. We chose:

LLaMA 3.1 70B (Open Source Model) as Deafblind LLM GPT-4 for question modification generation

Allowing for random chance, do something like this:

All five permutations of answer selection order were tested. We marked a question as degenerate if the model answered it correctly in 3 out of 5 trials.

Results of adversarial refinement

Simply put, this is the effect of performing adversarial refinement with CinePile.

Fixed 90.24% of degenerate Q&A pairs in the test set Manually reviewed unfixable Q&A pairs (approximately 80 out of 800) Fixed if possible Otherwise from rating split Excluded Fixed 90.94% of weak pairs in the training set Retained unfixable pairs as they cannot be fixed and negatively impact performance

implementation

This release exposes both the adversarial refinement pipeline and the code for identifying weak questions. The complete implementation, including all prompts, is available in the public repository.

evaluation

After testing the previously evaluated models and the 16 new Video-LLMs on our modified test set, we have highlighted the top performing models in the image below. Here are the results:

Gemini 1.5 Pro is the leading commercial vision language model (VLM)

Especially excels in “setting and technical analysis” Performs best on visually-driven questions about film environments and character interactions

GPT-based model showed competitive performance

Strong in “narrative and plot analysis” Scored well on questions related to story development and character interactions

Gemini 1.5 Flash, lightweight version of Gemini 1.5 Pro

Achieved overall accuracy of 58.75%. Particularly excellent results in “Settings and Technical Analysis”

open source model

The open source video LLM community has come a long way from the first release of CinePile to the current release. Here’s what we learned:

hard split

CinePile’s hard-split results clearly demonstrate that current models still fall far short of human capabilities when it comes to understanding visual narratives and story elements. This gap highlights the value of new releases of CinePile as benchmarks for measuring progress toward more sophisticated visual understanding.

leader board

We’ve launched our new CinePile Leaderboard. This is continually updated as new models come out. Watch this space to find out how to submit your own model for evaluation.

See Full Bio

What's Hot

Creating innovative content at your fingertips

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Most Popular

New Star: Discover why 보니 is the future of AI art

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

Presight plans to expand its AI business internationally

Don't Miss

Creating innovative content at your fingertips

The UK and Singapore form an alliance to guide AI into finance

StarCoder2 and Stack V2

Subscribe to Updates

What's Hot

CinePile 2.0 – Creating more powerful datasets with adversarial refinement

hang on. What is Cinepile?

Take a look at the data sample

Please tell me more. How was the original dataset assembled?

Inspect the quality of the first result

Cine Pile 2.0

CinePile 1.0 issues

hostile improvements

Results of adversarial refinement

implementation

evaluation

open source model

hard split

leader board

Related Posts