A new benchmark for evaluating multimodal systems based on real video, audio and text data
From the Turing test to Imagenet, benchmarks can help define research goals and enable researchers to measure progress towards those goals, thus creating a tool-driven role in shaping artificial intelligence (AI). It’s done. Incredible breakthroughs over the past decade, such as AlexNet in Computer Vision and AlphaFold in Protein Fold, have been closely linked to the use of benchmark datasets, allowing researchers to rank model design and training choices and can be repeated to improve the model. As we work towards our goal of building artificial general information (AGI), developing robust and effective benchmarks that extend the capabilities of our AI models is just as important as developing the model itself. is.
Recognition – the process of experiencing the world through the senses – is an important part of intelligence. And architectural agents with a human-level perceptual understanding of the world are central but challenging tasks, increasingly important in robotics, self-driving cars, personal assistants, medical imaging, and more. Therefore, today we are introducing the Perception Test, a multimodal benchmark using real videos, to assess the perceptual abilities of the model.
Development of recognition benchmarks
Currently, many recognition-related benchmarks are used throughout AI research, including video action recognition dynamics, audio sets for audio event classification, MOT for object tracking, and VQA for image question answering. These benchmarks have made surprising advances in how the architecture and training methods of AI models are constructed and developed, each targeting only limited aspects of perception. Image benchmarks exclude temporal aspects. Answer visual questions tends to focus on understanding the high-level semantic scene. Object tracking tasks generally capture the low-level appearance of individual objects, such as colors and textures. Additionally, there are few benchmarks that define tasks in both audio and visual modalities.
Multimodal models such as Perceiver, Flamingo, and Beit-3 aim to become more common perceptual models. However, their assessments were based on multiple specialized data sets, as dedicated benchmarks were not available. This process is slow, expensive and provides incomplete coverage of general recognition capabilities like memory, making it difficult for researchers to compare methods.
To address many of these issues, we created a dataset of intentionally designed videos of actual activities, labeled according to six different types of tasks.
Object Tracking: Boxes are provided around the objects early in the video. The model must return full tracks throughout the entire video (including occlusion). Point Tracking: Points are selected early in the video. Points throughout the video (via occlusion). Periodic Action Localization: The model must temporarily localize and categorize a set of predefined action sets. Answer: Textual questions about videos. Each has three options for selecting an answer. Grounded Video Question: Textual Questions about the video, the model must return one or more object tracks.
We designed 37 video scripts, inspired by the way that children’s perceptions are evaluated in developmental psychology and are evaluated from synthetic datasets such as Cater and Clevrer. Each variation was filmed by participants with a crowd of at least 12 people raised (similar to previous work and something), with over 100 participants totaling over 11,609 videos, and an average of over 100 participants It’s now a 23-second video.
The video shows simple games and daily activities. This allows you to define tasks that require the following skills to resolve:
Knowledge of semantics: Test aspects such as task completion, object, action, or sound recognition. Understanding physics: collisions, movement, occlusion, spatial relations. Scene.Abstraction ability: Shape matching, same/different concepts, pattern detection.
Crowdsourced participants labeled the videos with spatial and temporal annotations (object bounding box tracks, point tracks, action segments, sound segments). Our research team designed per-script type questions to ensure skills diversity based on tasks that touch multiple choice and grounded video questions. For example, ensuring questions that investigate your ability to reason rebuttal or provide an explanation for a particular situation. Corresponding responses for each video were again provided by crowdsourced participants.
Evaluate multimodal systems with perceptual tests
Assume the model is pre-trained with external datasets and tasks. Perceptual tests include a small set of tweaks (20%) that can be used optionally by modelers to communicate the nature of the task to the model. The remaining data (80%) consists of a split of public validation and a split of holding tests that can only evaluate performance through the evaluation server.
Here is an illustration of the evaluation setup. Inputs are video and audio sequences and task specifications. Tasks can be in high-level text format for visual question answering or low-level input, such as the coordinates of the object bounding box of the object in the Object Tracking task.
The evaluation results are detailed across several dimensions, measuring capabilities in six computational tasks. For the Visual Question Task, we also provide a mapping of questions across the types of situations shown in the video and type inference required to answer the questions for a more detailed analysis (for more information, see our (see the paper). The ideal model maximizes the score across all radar plots and all dimensions. This is a detailed assessment of the model’s skills and allows you to narrow down the area of improvement.
Ensuring diversity of participants and scenes shown in the video was an important consideration when developing benchmarks. To do this, we chose participants from different countries of different ethnicities and genders and aimed to provide diverse representations within each type of video script.
Learn more about perceptual tests
The Perception Test Benchmark is published here and more information is available in our paper. Leaderboards and challenge servers will also be available soon.
On October 23, 2022, we will hold a workshop on general recognition models at the European Conference on Computer Vision in Tel Aviv (ECCV 2022). Here we discuss the approach and how to design and evaluate other key recognition models. On-site experts.
We hope that cognitive testing will stimulate and guide further research towards popular perceptual models. In the future, we hope to work with the multimodal research community to introduce additional annotations, tasks, metrics, or new languages to our benchmarks.
If you are interested in contributing, please email PerceptionTest@google.com!