During the development of docmatix, I noticed that the fine-tuned Florence-2 brings excellent performance on docvqa, but I noticed that my score was lower on the benchmark. To improve performance, we had to further fine-tune the model with DOCVQA to learn the syntax required for benchmarking. Interestingly, this additional tweak appeared to be worse, according to human evaluators, and therefore released a model that mainly used it for ablation studies and was trained only with Docmatix for wider use.
The generated responses are semantically consistent with the reference response, as shown in Figure 1, but still receive a low score. This raises these questions. Do we need to fine-tune our models to improve these metrics, or do we need to develop human perceptions and better new metrics?

Figure 1: Zero shot generation and t-sne visualization of reference answers from docmatix datasets
introduction
Our community has recently focused on distributed exclusion (OOD) assessments, taking advantage of zero-shot transfers to invisible VQA tasks, fine-tuning in one VQA dataset, and fine-tuning in another VQA. This shift is increasingly related to the rise in synthetic datasets such as Docmatix, Scigraphqa, and Simvqa, which are used to fine-tune the Vision Language Model (VLM).
Traditionally, VQA accuracy has been a key metric for assessing model performance. This relies on the exact string match between the predicted answers of the model and the set of human annotated reference answers. This metric worked well as it followed an independent, identically distributed (IID) paradigm in which VQA ratings were independent. This allowed the training and testing data distributions to be similar, allowing the model to effectively see the details here.
OOD settings may cause the generated answers to not match the reference answers due to differences in format, specificity, or interpretation. This paradigm is fully illustrated in Figure 1. Here we compare reference captions and zero-shot generation captions from synthetic datasets. This is especially true for the dataset generated by the instruction and the counterparts that have been curated by the human. Some methods have tried to align the answer format with references, but this only addresses symptoms, not the root cause of the defective evaluation metric. Human ratings are reliable, but they emphasize the need for metrics that are not costly and scalable, but are suitable for human judgments.
method
DocvQA dataset generated from a curated document dataset, PDFA. It is 100 times larger than previously available datasets. The human-curated counterpart is DOCVQA, which serves as an evaluation benchmark for the VQA model for document understanding. In this post, we will use a subset of Docmatix, which consists of approximately 200 test samples.


Figure 2: Examples of Q&A pairs from docmatix and docvqa test sets. Note: The corresponding image will not be displayed here.
The questions and answer pairs for Docmatix and docvqa are similar, but the styles are very different. Traditional indicators such as Cider, ANLS, and BLEU can be overly limited due to zero-shot evaluation in this context. Motivated by the similarity of embedding observed in T-SNE (Fig. 1), we decided to use a different evaluation metric. In this post, we will consider the Lave (LLM-assisted VQA assessment) metric to better evaluate the generalization of this invisible but semantically similar data set.
Figure 3: T-Sne visualization of questions, answers and image functions from docmatix and docvqa datasets
Figure 5: T-Sne visualization of questions, answers and image functions from docmatix and docvqa datasets
For evaluation, mplugdocowl1.5 was selected as the baseline model. This model achieves an ANLS score of 84% on the test subset of the original DOCVQA dataset. Next, I ran a zero shot generation on a subset of Docmatix, consisting of 200 images. I used Llama-2-chat-7b to evaluate the responses.
About Rave
I followed the procedures outlined in the paper. VQA assessments are framed as response assessment tasks suitable for in-context learning using LLMS. We used 1-3 rating scales to account for ambiguous questions and incomplete answers. The prompts included task descriptions, some demonstrations of input/output, and inputs for test examples.
We included the instructions to “give grounds before evaluation” to construct a description of the task and introduce justifications for assigned assessments. Each demonstration included questions, a set of reference responses, candidate responses, response ratings, and explanations of evaluations. It also includes “providing only one rating” to avoid sentence-by-sentence analysis.
task_description = “” “You will be given a question, a set of written gold standard reference answers
Experts and candidate answers. Evaluate the accuracy of candidates’ responses to questions
Consider the answer to the reference. Using a scale of 1-3, 1 indicates wrong or irrelevant
Answer, 2 indicates ambiguous or incomplete answer, and 3 indicates correct answer.
Give a rationale before evaluation. We only provide one rating.
This is very important:
Binary questions should only be answered with “yes” or “no”.
Otherwise, the candidate’s answer is incorrect. “” “
Demonstration = ({
“question”: “How’s the weather?”,
“Reference_answer”🙁“Sunny”, “Clear”, “bright”, “Sunny”, “Sunny”),,
“generated_answer”: “cloudy”
})
Scoring function
Given the generated text of the LLM in the example test, we extracted the ratings from the last character (1, 2, or 3) and mapped them to the range score (0, 1): (s = \frac {r -1} {2})
Results table
The results of the evaluation are summarized in the table below.
Metricsider Blue Anls Lave Score 0.1411 0.0032 0.002 0.58
Qualitative example

Figure 4: Llamas evaluation and rationale for generation and reference responses from Docmatix test subsets.

Figure 5: Llamas evaluation and rationale for generated and reference responses from the docmatix test subset.
Is it too strict to evaluate a VQA system and requires fine tuning?
When evaluating responses using LLMS, there is an increase in accuracy of approximately 50%, indicating that the answer is correct despite not following strict formatting. This suggests that the current valuation metrics may be too stiff. It is important to note that this is not a comprehensive research paper, and more ablation studies are needed to fully understand the effectiveness of different metrics in assessing zero-shot performance on synthetic datasets. We hope that this work will serve as a starting point for improving the evaluation of zero-shot vision language models within the context of synthetic datasets and broadening the focus of current research to explore more efficient approaches beyond rapid learning.
reference
@inproceedings {cascante2022222222 {simvqa: Exploring simulated environments to respond to visual questions}, authors = {cascante-bonilla, paola and wu, hui and wang, letao and feris, rogerio s and ordonez, victle = cvf/cbf octies pattern recognition}, pages = {5056–5066}, year = {2022}} @article {hu2024mplug, title = {mplug-docowl 1.5: unified structure learning for ocr free document understanding}, authors = {hu, anwen and xu, haiyang and ye and ye and yan and han and, and, Li, Chen and Chang, Jin and Jin, Qin and Huang, Fei, etc., authors = {Agrawal, Aishwarya and Kaji {\’C}, Ivana and Bugliarello, Emanuele and Davoodi, Elnaz and Gergely, Anita and Blusom, Phil and Nematzadeh, Aida}, Journal = {arxiv preprint arxiv: 2205.12191} @inproceedings {li2023blip, title={blip-2: pretraining language images using frozen image encoder and large-scale language models}, authors = {li, junnan and li, dongxu and dongxu, silvio and hoi, steven}, booktitle = {booktitle = {197377737 year = {2023}, organization = {pmlr}} @inproceedings {manas2024provinving, title = {improving automatic VQA evaluation using major language models}, authors = {ma {\~n} as, oscar and krojer, benno and agrawal, aishwarya, aaiswarya = Intelligence}, volume={38}, number={5}, pages={4171–4179}, year={2024}} @article {li2023scigraphqa, title={scigraphqa: Authors of large synthetic multi-turn anchor question dataset Tajbakhsh, nima}, journal={arxiv preprint arxiv:2308.03349}, year={2023}}