In this blog post, we’ll share our journey towards releasing CinePile 2.0, a much improved version of our long video QA dataset. The new dataset refinement relies on a new approach: adversarial dataset refinement.
We are pleased to share both CinePile 2.0 and our implementation of the adversarial refinement method. We believe this can enhance many existing datasets and be directly incorporated as part of future dataset creation pipelines.
If you are primarily interested in adversarial refinement methods, you can jump directly to the “Adversarial Refinement” section.
hang on. What is Cinepile?
In May 2024, we launched CinePile, a long video QA dataset with approximately 300,000 training samples and 5,000 test samples.
The first release stood out from other datasets in two ways:
Variety of questions: Covers temporal understanding, plot analysis, character dynamics, setting, and theme. Question difficulty: In our benchmarks, humans outperformed the best commercial vision models by 25% and outperformed open source vision models by 65%.
Take a look at the data sample
Part of the secret sauce behind it is that it relies on YouTube movie clips and Q&A pulled from accurate audio descriptions designed for visually impaired viewers. These descriptions provide rich context beyond the basic visuals (e.g., “What color is the car?”) and help you formulate more complex questions.
Please tell me more. How was the original dataset assembled?
To automate question creation, we first built question templates by inspecting existing datasets such as MovieQA and TVQA. We cluster questions in these datasets using the text similarity model WhereIsAI/UAE-Large-V1 and use 10 random examples from each cluster to create question templates and I have generated a prototype question.
Categories Question Templates Typical Questions Character and Relationship Dynamics (CRD) Interpersonal Dynamics What changes occur in the relationship between A and B after they share an experience or behavior?Character and Relationship Dynamics (CRD) Justification of the Decision What reasons did the characters give for making the decision?Narrative and Plot Analysis (NPA) Crisis Events What major events led the characters to take drastic action?Story and Plot analysis (NPA) Mysteries Revealed What secrets does Character A reveal about Event B? Setting and Technical Analysis (STA) What physical possessions (character name) does he have? ?What are the configuration and technical analysis (STA) environment details (configuration/location) (duration/time) (specific time/location/event)?Temporal (TEMP) Time-sensitive critical actions ( character) What must be done immediately? What are the consequences of not doing so? Temporal (Temp) Frequency How many times does the character attempt (Action A)? Thematic Exploration (TH) Tracking Symbolism and Motifs Do any symbols or motifs introduced in Scene A reappear or evolve in Scene B? And what do they mean? Thematic Exploration (TH) Thematic similarities What is the confusion in the scene similar to in terms of the film’s themes?
Since templates are not necessarily relevant to every movie clip, we used Gemini 1.0 Pro to select the most appropriate template for each scene. Next, enter the language model with the text for the scene, the template name of your choice (such as “Physical Possession”), sample questions, and system prompts to create scene-specific questions. Well-designed prompts help the model focus on the whole scene, generating deeper questions while avoiding superficial questions. I found out the following:
Prevent GPT-4 hallucinations by providing prototype examples and including timestamps in dialogue and visual descriptions. This approach leads to more plausible multiple-choice questions (MCQ) distractors. Improving the quality of your questions by asking your model to provide evidence for its answers.
Using this approach, approximately 32 questions are generated per video. Before releasing CinePile, we implemented several mechanisms to ensure the quality of our datasets/benchmarks. This will be explained in the next section.
Inspect the quality of the first result
While our process typically produces well-formed, answerable questions, some questions are trivial or rely on fundamental concepts that don’t require viewing the clip. To address this, we used several large-scale language models (LLMs) to identify and filter three types of problems:
degeneracy problem
A question is considered “degenerate” if the answer is obvious from the question itself (e.g., “What color is the pink house?”). These are just a small part of the dataset. Manual review was not possible at our size. ,We adopted three LLMs (Gemini, GPT-3.5, and Phi-1.5) for,automatic detection. If all three models answered correctly without context, the question was removed from the evaluation set.
Problems with visual dependence
Some multiple-choice questions do not require visual information and can be answered using only dialogue. We used the Gemini model to determine whether a question could be answered using dialogue alone. Questions were given a binary score: 0 if answerable without visual information, 1 if visual information was required.
Difficulty rating
To assess the difficulty of the questions, we tested whether the model could answer correctly even when given the full context (both visual description and subtitles).
Through continued use of the benchmark by our team and the broader community, we have identified several areas of improvement that led us to consider CinePile 2.0.
Cine Pile 2.0
For the second release of CinePile, we worked with Hugging Face to identify and prioritize several areas for improvement (following successful experiments tweaking Video Llava 7B on CinePile) .
CinePile 1.0 issues
Degenerate filtering was useful in CinePile 1.0, but it had some limitations.
Some questions could be answered using just Q&A pairs, without requiring transcripts or visual content Many of the flagged questions included valuable insights from the video. – instead of discarding them, they could have been reworded to better capture the value Degeneracy checks were limited to the test set: running multiple models (especially your own models) , CinePile 1.0 training It was too expensive for the set to scale.
To address these issues, we’ve introduced a new Adversarial Refinement pipeline that helps you improve weak questions instead of just discarding them. This approach is easier to apply at scale. Throughout this post, we will refer to the model that identifies degenerate questions (using only question and answer choices, without visual or interactive information) as the “Deafblind LLM.”
hostile improvements
The Adversarial Refinement pipeline aims to modify questions or answers until the DeafBlind LLM cannot easily predict the correct answer. Here’s how it works:
Deaf-Blind LLM provides both answers and rationales to explain its choices based solely on the question. These rationales can help you identify implicit cues and biases embedded in your questions. Our question generation model uses this rationale to modify questions and answer choices. Removing Implicit Cues This process is repeated up to five times for each question until the deafblind LLM’s performance randomly decreases.
Given the computational demands of this iterative process, we needed a powerful and accessible LLM that could be run locally to avoid API usage limitations, delays, and costs of cloud services. We chose:
LLaMA 3.1 70B (Open Source Model) as Deafblind LLM GPT-4 for question modification generation
Allowing for random chance, do something like this:
All five permutations of answer selection order were tested. We marked a question as degenerate if the model answered it correctly in 3 out of 5 trials.
Results of adversarial refinement
Simply put, this is the effect of performing adversarial refinement with CinePile.
Fixed 90.24% of degenerate Q&A pairs in the test set Manually reviewed unfixable Q&A pairs (approximately 80 out of 800) Fixed if possible Otherwise from rating split Excluded Fixed 90.94% of weak pairs in the training set Retained unfixable pairs as they cannot be fixed and negatively impact performance
implementation
This release exposes both the adversarial refinement pipeline and the code for identifying weak questions. The complete implementation, including all prompts, is available in the public repository.
evaluation
After testing the previously evaluated models and the 16 new Video-LLMs on our modified test set, we have highlighted the top performing models in the image below. Here are the results:
Gemini 1.5 Pro is the leading commercial vision language model (VLM)
Especially excels in “setting and technical analysis” Performs best on visually-driven questions about film environments and character interactions
GPT-based model showed competitive performance
Strong in “narrative and plot analysis” Scored well on questions related to story development and character interactions
Gemini 1.5 Flash, lightweight version of Gemini 1.5 Pro
Achieved overall accuracy of 58.75%. Particularly excellent results in “Settings and Technical Analysis”
open source model
The open source video LLM community has come a long way from the first release of CinePile to the current release. Here’s what we learned:
hard split
CinePile’s hard-split results clearly demonstrate that current models still fall far short of human capabilities when it comes to understanding visual narratives and story elements. This gap highlights the value of new releases of CinePile as benchmarks for measuring progress toward more sophisticated visual understanding.
leader board
We’ve launched our new CinePile Leaderboard. This is continually updated as new models come out. Watch this space to find out how to submit your own model for evaluation.