Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

StarCoder2 and Stack V2

July 4, 2025

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

July 3, 2025

CAC has announced AI-powered business registration portal – thisdaylive

July 3, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Friday, July 4
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Lave: Zero Shot VQA Evaluation of Docmatix using LLMS
Tools

Lave: Zero Shot VQA Evaluation of Docmatix using LLMS

versatileaiBy versatileaiMarch 29, 2025No Comments7 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email




During the development of docmatix, I noticed that the fine-tuned Florence-2 brings excellent performance on docvqa, but I noticed that my score was lower on the benchmark. To improve performance, we had to further fine-tune the model with DOCVQA to learn the syntax required for benchmarking. Interestingly, this additional tweak appeared to be worse, according to human evaluators, and therefore released a model that mainly used it for ablation studies and was trained only with Docmatix for wider use.

The generated responses are semantically consistent with the reference response, as shown in Figure 1, but still receive a low score. This raises these questions. Do we need to fine-tune our models to improve these metrics, or do we need to develop human perceptions and better new metrics?

VQA evaluation

Figure 1: Zero shot generation and t-sne visualization of reference answers from docmatix datasets

introduction

Our community has recently focused on distributed exclusion (OOD) assessments, taking advantage of zero-shot transfers to invisible VQA tasks, fine-tuning in one VQA dataset, and fine-tuning in another VQA. This shift is increasingly related to the rise in synthetic datasets such as Docmatix, Scigraphqa, and Simvqa, which are used to fine-tune the Vision Language Model (VLM).

Traditionally, VQA accuracy has been a key metric for assessing model performance. This relies on the exact string match between the predicted answers of the model and the set of human annotated reference answers. This metric worked well as it followed an independent, identically distributed (IID) paradigm in which VQA ratings were independent. This allowed the training and testing data distributions to be similar, allowing the model to effectively see the details here.

OOD settings may cause the generated answers to not match the reference answers due to differences in format, specificity, or interpretation. This paradigm is fully illustrated in Figure 1. Here we compare reference captions and zero-shot generation captions from synthetic datasets. This is especially true for the dataset generated by the instruction and the counterparts that have been curated by the human. Some methods have tried to align the answer format with references, but this only addresses symptoms, not the root cause of the defective evaluation metric. Human ratings are reliable, but they emphasize the need for metrics that are not costly and scalable, but are suitable for human judgments.

method

DocvQA dataset generated from a curated document dataset, PDFA. It is 100 times larger than previously available datasets. The human-curated counterpart is DOCVQA, which serves as an evaluation benchmark for the VQA model for document understanding. In this post, we will use a subset of Docmatix, which consists of approximately 200 test samples.

Image 1
Image 2

Figure 2: Examples of Q&A pairs from docmatix and docvqa test sets. Note: The corresponding image will not be displayed here.

The questions and answer pairs for Docmatix and docvqa are similar, but the styles are very different. Traditional indicators such as Cider, ANLS, and BLEU can be overly limited due to zero-shot evaluation in this context. Motivated by the similarity of embedding observed in T-SNE (Fig. 1), we decided to use a different evaluation metric. In this post, we will consider the Lave (LLM-assisted VQA assessment) metric to better evaluate the generalization of this invisible but semantically similar data set.

Figure 3: T-Sne visualization of questions, answers and image functions from docmatix and docvqa datasets

Figure 5: T-Sne visualization of questions, answers and image functions from docmatix and docvqa datasets

For evaluation, mplugdocowl1.5 was selected as the baseline model. This model achieves an ANLS score of 84% on the test subset of the original DOCVQA dataset. Next, I ran a zero shot generation on a subset of Docmatix, consisting of 200 images. I used Llama-2-chat-7b to evaluate the responses.

About Rave

I followed the procedures outlined in the paper. VQA assessments are framed as response assessment tasks suitable for in-context learning using LLMS. We used 1-3 rating scales to account for ambiguous questions and incomplete answers. The prompts included task descriptions, some demonstrations of input/output, and inputs for test examples.

We included the instructions to “give grounds before evaluation” to construct a description of the task and introduce justifications for assigned assessments. Each demonstration included questions, a set of reference responses, candidate responses, response ratings, and explanations of evaluations. It also includes “providing only one rating” to avoid sentence-by-sentence analysis.

task_description = “” “You will be given a question, a set of written gold standard reference answers
Experts and candidate answers. Evaluate the accuracy of candidates’ responses to questions
Consider the answer to the reference. Using a scale of 1-3, 1 indicates wrong or irrelevant
Answer, 2 indicates ambiguous or incomplete answer, and 3 indicates correct answer.
Give a rationale before evaluation. We only provide one rating.
This is very important:
Binary questions should only be answered with “yes” or “no”.
Otherwise, the candidate’s answer is incorrect. “” “

Demonstration = ({
“question”: “How’s the weather?”,
“Reference_answer”šŸ™“Sunny”, “Clear”, “bright”, “Sunny”, “Sunny”),,
“generated_answer”: “cloudy”
})

Scoring function

Given the generated text of the LLM in the example test, we extracted the ratings from the last character (1, 2, or 3) and mapped them to the range score (0, 1): (s = \frac {r -1} {2})

Results table

The results of the evaluation are summarized in the table below.

Metricsider Blue Anls Lave Score 0.1411 0.0032 0.002 0.58

Qualitative example

VQA evaluation

Figure 4: Llamas evaluation and rationale for generation and reference responses from Docmatix test subsets.

VQA evaluation

Figure 5: Llamas evaluation and rationale for generated and reference responses from the docmatix test subset.

Is it too strict to evaluate a VQA system and requires fine tuning?

When evaluating responses using LLMS, there is an increase in accuracy of approximately 50%, indicating that the answer is correct despite not following strict formatting. This suggests that the current valuation metrics may be too stiff. It is important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics in assessing zero-shot performance on synthetic datasets. We hope that this work will serve as a starting point for improving the evaluation of zero-shot vision language models within the context of synthetic datasets and broadening the focus of current research to explore more efficient approaches beyond rapid learning.

reference

@inproceedings {cascante2022222222 {simvqa: Exploring simulated environments to respond to visual questions}, authors = {cascante-bonilla, paola and wu, hui and wang, letao and feris, rogerio s and ordonez, victle = cvf/cbf octies pattern recognition}, pages = {5056–5066}, year = {2022}} @article {hu2024mplug, title = {mplug-docowl 1.5: unified structure learning for ocr free document understanding}, authors = {hu, anwen and xu, haiyang and ye and ye and yan and han and, and, Li, Chen and Chang, Jin and Jin, Qin and Huang, Fei, etc., authors = {Agrawal, Aishwarya and Kaji {\’C}, Ivana and Bugliarello, Emanuele and Davoodi, Elnaz and Gergely, Anita and Blusom, Phil and Nematzadeh, Aida}, Journal = {arxiv preprint arxiv: 2205.12191} @inproceedings {li2023blip, title={blip-2: pretraining language images using frozen image encoder and large-scale language models}, authors = {li, junnan and li, dongxu and dongxu, silvio and hoi, steven}, booktitle = {booktitle = {197377737 year = {2023}, organization = {pmlr}} @inproceedings {manas2024provinving, title = {improving automatic VQA evaluation using major language models}, authors = {ma {\~n} as, oscar and krojer, benno and agrawal, aishwarya, aaiswarya = Intelligence}, volume={38}, number={5}, pages={4171–4179}, year={2024}} @article {li2023scigraphqa, title={scigraphqa: Authors of large synthetic multi-turn anchor question dataset Tajbakhsh, nima}, journal={arxiv preprint arxiv:2308.03349}, year={2023}}

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticlePrediction: This artificial intelligence (AI) company will become the biggest beneficiary of self-driving cars (hint: not Tesla)
Next Article How a simple chair from the 1980s demonstrated USF’s global leadership in the title of Aiarticle
versatileai

Related Posts

Tools

StarCoder2 and Stack V2

July 4, 2025
Tools

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

July 3, 2025
Tools

Research shows that AI can reduce global carbon emissions

July 3, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

June 2, 20251 Views

Presight plans to expand its AI business internationally

April 14, 20251 Views

PlanetScale Vectors GA: MySQL and AI Database Game Changer

April 14, 20251 Views
Don't Miss

StarCoder2 and Stack V2

July 4, 2025

Intel®Gaudi®2AI Accelerator Text Generation Pipeline

July 3, 2025

CAC has announced AI-powered business registration portal – thisdaylive

July 3, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?