The model is very good at understanding text on its own, but what about the text in the image? Does this provide important contextual information? For example, do you navigate maps or understand memes? The ability to infer about the interaction between image text and visual context can drive many real applications, such as AI assistants, and tools to help visually impaired people.
These tasks are called “context-sensitive, text-rich visual inference tasks.”
At the moment, most evaluations of instruction-tuned large multimode models (LMMs) focus on testing how well the models can respond more appropriately as questions or imperative statements (such as “counts”, “lists”) than images…
That’s why we (researchers at the University of California, Los Angeles) created a context, a text-sensitive visual inference dataset for evaluating LMMs. We also released the leaderboard so the community can see for themselves which models are best for this task.
For detailed diving, you can also view these additional resources (paper, code, dataset, validation dataset, and leaderboards).
What is a context?
Contextual is a context-sensitive, text-rich visual inference dataset consisting of 506 challenging instructions for LMM evaluation. Create a variety of instructions sets for text-rich images with the constraint that there is a need for context-sensitive collaborative inference around text and visual cues within the image.
It covers eight real visual scenarios including time reading, shopping, navigation, abstract scenes, mobile applications, web pages, infographics and other natural scenes. (See the diagram for sample datasets.)
Each sample is as follows:
Text-rich image Human-written instructions (question or imperative task) Human-written reference response
The dataset is released in two formats.
(a) A validation set of 100 instances from the complete dataset with instructions, images and reference answers. (b) Test data set with only steps and images.
The leaderboard contains model results in both validation and test datasets (information also exists in the paper). The development set allows practitioners to easily test and iterate the approach. The rating sandbox is available on GitHub.
experiment
In the first experiment, the benchmark evaluated the performance of 13 models. I divided them into three categories.
Enhanced LLM Approach: Visual information in the form of GPT4+ OCR and/or dense image captions. Closed-source LMMS: GPT4V (ISION) and GEMINI-VISION-PRO; Open-source LMMS: LLAVA-V1.5-13B, ShareGPT4V-7B, Instruct-Blip-Vicuna-7B, Mplugowl-V2-7B, Bliva-Vicuna-7B, Qwen-VL-7B and IDEFICS-9B.
The dataset contains reference responses for each instruction, allowing you to test a variety of automated evaluation methods. For evaluation, use the LLM-As-A-Judge approach and prompt GPT-4 with instructions, reference responses, and predicted responses. The model must return whether the predicted response is acceptable. (GPT4 was chosen because it was most correlated with human judgment in the experiment.)
Let’s take a look at some examples!
Example 1 In this example, GPT-4V provides an incorrect response to an instruction despite logical inference. The use of green indicates a response that matches the reference, while the red emphasizes an error in the response. Additionally, a summary inference is provided to outline the rationale used by GPT-4V and reach the answer.
Example 2 In this example, GPT-4V responds correctly to instructions. However, there is no joint inference on text and images, so ShareGPT-4V-7B (best performance open source LMM) and GPT-4 W/Layout Aware OCR + Caption (Extended LLM) generate incorrect responses.
Examples like this can be found in the Appendix section of our paper!
Key takeout!
While working on this, we found it:
Modern LMMS (unique and open models) struggle to run on contextual datasets, but humans are good at it, suggesting the potential for model improvements to enhance inference over text-rich images, a domain with important real-world applications. The unique LMM doesn’t work very well with infographic inferences that include time reading. This shows the gap in abilities compared to humans. In particular, the best performance model, GPT-4V, outperforms humans in abstract inference due to exposure to memes and citation data, but they struggle with time-related tasks where humans are superior. For open source models such as LLAVA-1.5-13B and ShareGPT-4V-7B, there is a strong gap between domains that achieve acceptable human evaluation (summary and natural scene context) and other domains (time reading, infographics, navigation, shopping, web, mobile usage). We aim to increase LMM with large language models converted to text via OCR or captions.
In our analysis, the next promising step is:
Development of an expanded image encoder, creates highly accurate image descriptions, promotes fine-grained visual language alignment to improve model perception and reduce the occurrence of hallucinations.
This leads to visual inference richer with more effective context-sensitive text.
What’s next?
We would like to evaluate your model and help you progress collectively through the state of your vision language model! To submit, please follow the guidelines below:
We hope this benchmark will help you develop subtle vision language alignment techniques and welcome any kind of collaboration! You can learn more about Rohan and Hritik, and here the team: Rohan, Hritik, Kai-Wei Chang, Nanyun (Violet) Peng.
How do I submit it?
We accept submissions for both test and validation sets. Follow the corresponding steps below.
Submitting a validation set
To send the verification results to the leaderboard, you can follow these instructions to run an automated evaluation code (an evaluation pipeline using GPT4).
As shown below, it is expected that your submission will be in JSON format.
{“model_name”: {“img_url”: “Model Boolean score on image, 1 for success, 0 for failure”}}
Replace model name with model name (string) Replace IMG_URL with IMG_URL IMG URL value (String) Value is 0 or 1 (int)
You need 100 predictions corresponding to 100 URLs in the VAL set.
To submit, please visit the leaderboard hosted on Huggingface and fill out the submission form.
Submitting a test set
Once you’ve satisfied with the results, you can send predictions for the model to Rohan and Hritik.
Please include it in your email:
The name of the model. Organization (affiliation). (Optional) Github Repo or Paper Link.
As shown below, we expect the submission to be in the same JSON format as the VAL set.
{“model_name”: {“img_url”: “Predicted Response”}}
Replace the model name with the model name (string) Replace the IMG_URL with the instance’s IMG_URL (string) The value of the IMG URL is the predicted response for that instance (String)
You need 506 predictions corresponding to the 506 URL in the test set.