QIMMA validates the benchmarks before evaluating the model to ensure that the reported scores reflect the LLM’s authentic Arabic proficiency.
If you’ve been following the Arabic LLM ratings, you’ve probably noticed that tensions are rising. The number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring?
We built QIMMA قمّة (meaning “peak” in Arabic) to systematically answer that question. Rather than simply aggregating existing Arabic benchmarks and running the model on top of them, we applied a rigorous quality validation pipeline before performing the evaluation. What we discovered was sobering. Even the widely used and well-regarded Arabic benchmarks contain systemic quality issues that can quietly undermine evaluation results.
In this article, we will explain what QIMMA is, how we built it, the issues we found, and how the model ranks after cleaning.

🔍 Problem: Arabic NLP evaluation is fragmented and untested
Although Arabic is spoken by more than 400 million people across diverse dialects and cultural backgrounds, the landscape of Arabic NLP evaluation remains fragmented. Several important issues motivate this work.
Translation issues. Many of the Arabic benchmarks are translations from English. This results in a change in the distribution. Benchmark data is less representative of natural usage of Arabic, as questions that feel natural in English can be awkward or culturally out of place in Arabic.
Quality verification has not been carried out. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation discrepancies, inaccurate gold answers, encoding errors, and cultural bias in ground truth labels are all documented in established resources.
Reproducibility gap. Evaluation scripts and per-sample output are rarely publicly available, making it difficult to audit results or build on previous work.
Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow areas, making it difficult to evaluate the overall model.
To explain where QIMMA stands compared to existing platforms:
Leaderboard Open Source Native Arabic Quality Verification Code Evaluation Public Output OALL v1✅Mixed❌❌✅ OALL v2✅Mostly❌❌✅ BALSAMPartial50%❌❌❌ AraGen✅100%❌❌❌ SILMA ABL✅100%✅❌✅ ILMAAM Partial 100%✅❌❌ HELM Arabic✅Mixed❌❌✅ ⛰ QIMMA ✅ 99% ✅ ✅ ✅
QIMMA is the only platform that combines all five characteristics: open source, primarily native Arabic content, systematic quality verification, code evaluation, and inference output per published sample.
⛰ What does QIMMA have?
QIMMA combines 109 subsets from 14 source benchmarks into a unified evaluation suite of more than 52,000 samples across seven domains.
Domain Benchmark Task Type CultureAraDiCE-Culture, ArabCulture, PalmXMCQ STEMArabicMMLU, GAT, 3LM STEMMCQ LegalArabLegalQA, MizanQAMCQ, QA MedicalMedArabiQ, MedAraBenchMCQ, QA SafetyAraTrustMCQ Poetry & LiteratureFannOrFlopQAcoding3LM HumanEval+, 3LM MBPP+Code
Several things stand out in this design.
99% native Arabic content. The only exception is code evaluation, which is essentially language independent. The first Arabic leaderboard with code evaluation. QIMMA integrates Arabic-adapted versions of HumanEval+ and MBPP+, allowing you to assess your coding ability with Arabic problem statements. Diversity of domains and tasks. QIMMA assesses real-world competency areas such as education, governance, healthcare, creative expression, and software development.
🔬 Quality verification pipeline
This is the heart of QIMMA’s methodology. We applied a multi-stage validation pipeline to all samples of all benchmarks before running a single model.
Stage 1: Automatic multi-model evaluation
Each sample was independently evaluated by two state-of-the-art LLMs.
Qwen3-235B-A22B-DeepSeek Instructions-V3-671B
We selected two models with strong Arabic features, but their training data has different compositions, making their combined decisions more robust than either alone.
Each model scores samples against a 10-point rubric using a binary score (0 or 1) for each criterion.
If any model scores less than 7/10, the sample is excluded. Samples that both models agree on removal are immediately removed. However, if only one model flags the sample, proceed to stage 2 human review.
Stage 2: Human annotation and review
Flagged samples are reviewed by native Arabic speakers who are culturally and dialectally familiar. The human annotator will make the final decision on:
Cultural background and regional differences Dialect nuances Subjective interpretation Subtle quality issues Automatic evaluation may miss
For culturally sensitive content, multiple points of view are considered, as “correctness” can be very different in different Arab regions.
⚠️ What we found: Systemic quality issues
The pipeline revealed recurring quality issues across benchmarks. This is not an isolated error, but a systematic pattern that reflects a gap in the way the benchmark was originally constructed.
Look at the numbers
Benchmark Total number of discarded samples Discard rate ArabicMMLU 14,163 436 3.1% MizanQA1,769412.3% PalmX3,001250.8% MedAraBench4,960330.7% FannOrFlop6,984430.6% ArabCulture3,48270.2% MedArabiQ49910.2% GAT13,9861~0.0% 3LM STEM2,6091~0.0% AraDiCE-Culture18000.0% ArabLegalQA7900.0% AraTrust52200.0%
Classification of problems found
⚖️ Quality of answers
False or inconsistent gold index, factually incorrect answers, missing or raw text answers.
📄 Text and formatting quality
Corrupted or illegible text, spelling and grammatical errors, and duplicate samples.
💬 Cultural Sensitivity
Reinforcement of stereotypes and monolithic generalizations about diverse communities.
🤝 Gold Answer Compliance
Inconsistency between gold answers and evaluation protocols.
💻 Code Benchmarking: A different kind of quality work
Benchmarking the code required another intervention. Rather than discarding the sample, we refined the Arabic problem statement in the Arabic adaptations of 3LM’s HumanEval+ and MBPP+, leaving the task identifier, reference solution, and test suite completely unchanged.
The revision rate was amazing.
Benchmark Total Prompts Changed No Change Change Rate 3LM HumanEval+1641451988% 3LM MBPP+3783087081%
Changes fall into five categories:
Linguistic improvements: Normalized towards natural Modern Standard Arabic and a consistent imperative style Improved clarity: Fixed ambiguous instructions and unclear constraints Normalized consistency: Standardized formatting of mathematical terminology, punctuation, and examples Structural fixes: Fixed broken triple-quoted strings, indentation errors, and broken text fragments Semantic improvements: Clarified whether scopes were inclusive or exclusive and maintained the intent of the task
⚙️ Evaluation setup
Evaluation framework
QIMMA uses LightEval, EvalPlus, and FannOrFlop as evaluation frameworks, chosen for their consistency, multilingual community adoption, and reproducibility.
Metrics by task type
Task Type Metric Benchmark MCQ Normalized Log Likelihood AccuracyAraDiCE-Culture, ArabicMMLU, ArabCulture, PalmX, 3LM STEM, MedArabiQ, GAT, MedAraBench, AraTrust Multiple Choice MCQProbability Mass of Gold ChoicesMizanQA Generative QAF1 BERTScore (AraBERT v02)MedArabiQ, ArabLegalQA, FannOrFlop CodePass@13LM HumanEval+, 3LM MBPP+
prompt template
QIMMA standardizes question-based prompts using six template types:

MCQ: General Multiple Choice · MCQ-C: Multiple Choice with Contextual Passage · MCQ-I: Multiple Choice with Specific Instructions (GAT Similar/Complete) · QA: General Free-Text QA · QA-C: QA with Context · QA-F: Fill-in-the-blank QA
All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original paper are saved.
🏆 Leaderboard Results
Actual results as of April 2026. Covers the top 10 rated models. Visit our live leaderboard for current rankings.
Rank Model AVERAGE AraDiCE-Culture ArabicMMLU ArabCulture PALMX 3LM STEM AraTrust MizanQA MedArabiQ ArabLegalQA GAT MedAraBench HumanEval+ MBPP+ FannOrFlop 🥇 1 Kwen/Kwen 3.5-397B-A17B-FP868.0682.7877.5461.7583.9188.6790.0473.3647.3054.9455.8947.9767.6876.7244.33 🥈 2 Applied Innovation Center/Karnak 66.2073.3380.9453.4981.4093.1089.0855.9255.7871.5861.0654.1933.5464.5558.91 🥉 3inceptionai/Jais-2-70B-Chat65.8178.8981.2983.2483.7387.9690.2371.7852.7969.6051.6750.8919.5143.6556.13 #4Qwen/Qwen2.5-72B-Instruct65.7577.2273.7863.8377.7787.5588.5163.4950.0670.7455.9044.1937.2072.7557.51 #5 Applied Innovation Center/AIC-165.3773.3372.0277.5276.1188.1390.6156.3653.7568.9662.1150.7828.0569.5847.83 #6 Kwen/Kwen 3.5-122B-A10B64.8474.4473.1737.7881.4686.1886.9764.0147.0455.1150.9052.4965.2472.4360.54 #7 Sakarti/Ultima-72B64.4978.3372.2868.7976.7583.7089.0860.4444.5869.1246.9142.2539.0274.0757.56 #8meta-llama/Llama-3.3-70B-Instruct63.9677.2271.5778.0577.9588.2885.6367.4456.2564.0051.1354.8627.4471.1624.43 #9Qwen/Qwen2.5-32B-Instruct63.2670.5668.7675.8072.0781.0385.8253.7848.0869.2756.9436.5134.1572.7593.10 #10FreedomIntelligence/AceGPT-v2-32B-Chat61.1476.6770.6279.7974.4684.8886.9763.8949.9671.4656.0447.3223.7854.5015.56
Scale does not guarantee best performance. The top 10 ranges from models with 32B to 397B parameters, with several medium models outperforming large models in certain domains. Arabic-specific models lead cultural and linguistic tasks. Jais-2-70B-Chat ranks highest in ArabicMMLU and ArabCulture, while Karnak ranks first in 3LM STEM and ArabLegalQA. Coding remains the most difficult area for Arabic-specific models. The top HumanEval+ and MBPP+ scores belong to multilingual models, with Qwen3.5-397B leading both.
Relationship between size and performance
Across the leaderboard (46 models), a clear but imperfect correlation between size and performance emerges. However, there are some interesting exceptions.

Arabic-specific models often outperform size-matched multilingual models Instruction-adjusted models consistently outperform base models, with the exception of Qwen3 Some small Arabic-specific models (Fanar-1-9B, ALLaM-7B) outperform larger multilingual models in certain domains
🌟 QIMMA Difference
The characteristic properties of QIMMA can be summarized as follows.
Property details Quality-first philosophy Validation before evaluation, not an afterthought Multi-model validation Two LLMs with different training, human review of flagged cases 99% native Arabic Almost completely avoids translation artifacts Multi-domain, multi-tasking 7 domains, 3 task types (MCQ, QA, code), 109 subsets Code evaluation First Arabic leaderboard with code generation Full transparency Per-sample inference output A unified, publicly available LightEval-based, reproducible assessment codebase, rather than just an aggregated score Dialect recognition Explicit handling of MSA and dialect variations in prompts and rubrics
🔗 Resources
🔖 Quote
@misc{alqadi2026arabicbenchmarksreliableqimmas, title={Are Arabic benchmarks reliable? QIMMA’s quality-first approach to LLM evaluation}, authors={Leen AlQadi, Ahmed Alzubaidi, Mohammed Alyafeai, Hamza Alobeidli, Maitha Alhammadi, Shaikha Alsuwaidi, Omar Alkaabi, Basma El Amel Boussaha, Hakim Hacid}, year={2026}, eprint={2604.03395}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.03395}, }

