Quality First Arabic LLM Leaderboard

QIMMA validates the benchmarks before evaluating the model to ensure that the reported scores reflect the LLM’s authentic Arabic proficiency.

If you’ve been following the Arabic LLM ratings, you’ve probably noticed that tensions are rising. The number of benchmarks and leaderboards is expanding rapidly, but are we actually measuring what we think we’re measuring?

We built QIMMA قمّة (meaning “peak” in Arabic) to systematically answer that question. Rather than simply aggregating existing Arabic benchmarks and running the model on top of them, we applied a rigorous quality validation pipeline before performing the evaluation. What we discovered was sobering. Even the widely used and well-regarded Arabic benchmarks contain systemic quality issues that can quietly undermine evaluation results.

In this article, we will explain what QIMMA is, how we built it, the issues we found, and how the model ranks after cleaning.

🔍 Problem: Arabic NLP evaluation is fragmented and untested

Although Arabic is spoken by more than 400 million people across diverse dialects and cultural backgrounds, the landscape of Arabic NLP evaluation remains fragmented. Several important issues motivate this work.

Translation issues. Many of the Arabic benchmarks are translations from English. This results in a change in the distribution. Benchmark data is less representative of natural usage of Arabic, as questions that feel natural in English can be awkward or culturally out of place in Arabic.

Quality verification has not been carried out. Even native Arabic benchmarks are often released without rigorous quality checks. Annotation discrepancies, inaccurate gold answers, encoding errors, and cultural bias in ground truth labels are all documented in established resources.

Reproducibility gap. Evaluation scripts and per-sample output are rarely publicly available, making it difficult to audit results or build on previous work.

Coverage fragmentation. Existing leaderboards cover isolated tasks and narrow areas, making it difficult to evaluate the overall model.

To explain where QIMMA stands compared to existing platforms:

Leaderboard Open Source Native Arabic Quality Verification Code Evaluation Public Output OALL v1✅Mixed❌❌✅ OALL v2✅Mostly❌❌✅ BALSAMPartial50%❌❌❌ AraGen✅100%❌❌❌ SILMA ABL✅100%✅❌✅ ILMAAM Partial 100%✅❌❌ HELM Arabic✅Mixed❌❌✅ ⛰ QIMMA ✅ 99% ✅ ✅ ✅

QIMMA is the only platform that combines all five characteristics: open source, primarily native Arabic content, systematic quality verification, code evaluation, and inference output per published sample.

⛰ What does QIMMA have?

QIMMA combines 109 subsets from 14 source benchmarks into a unified evaluation suite of more than 52,000 samples across seven domains.

Domain Benchmark Task Type CultureAraDiCE-Culture, ArabCulture, PalmXMCQ STEMArabicMMLU, GAT, 3LM STEMMCQ LegalArabLegalQA, MizanQAMCQ, QA MedicalMedArabiQ, MedAraBenchMCQ, QA SafetyAraTrustMCQ Poetry & LiteratureFannOrFlopQAcoding3LM HumanEval+, 3LM MBPP+Code

Several things stand out in this design.

99% native Arabic content. The only exception is code evaluation, which is essentially language independent. The first Arabic leaderboard with code evaluation. QIMMA integrates Arabic-adapted versions of HumanEval+ and MBPP+, allowing you to assess your coding ability with Arabic problem statements. Diversity of domains and tasks. QIMMA assesses real-world competency areas such as education, governance, healthcare, creative expression, and software development.

🔬 Quality verification pipeline

This is the heart of QIMMA’s methodology. We applied a multi-stage validation pipeline to all samples of all benchmarks before running a single model.

Stage 1: Automatic multi-model evaluation

Each sample was independently evaluated by two state-of-the-art LLMs.

Qwen3-235B-A22B-DeepSeek Instructions-V3-671B

We selected two models with strong Arabic features, but their training data has different compositions, making their combined decisions more robust than either alone.

Each model scores samples against a 10-point rubric using a binary score (0 or 1) for each criterion.

If any model scores less than 7/10, the sample is excluded. Samples that both models agree on removal are immediately removed. However, if only one model flags the sample, proceed to stage 2 human review.

Stage 2: Human annotation and review

Flagged samples are reviewed by native Arabic speakers who are culturally and dialectally familiar. The human annotator will make the final decision on:

Cultural background and regional differences Dialect nuances Subjective interpretation Subtle quality issues Automatic evaluation may miss

For culturally sensitive content, multiple points of view are considered, as “correctness” can be very different in different Arab regions.

⚠️ What we found: Systemic quality issues

The pipeline revealed recurring quality issues across benchmarks. This is not an isolated error, but a systematic pattern that reflects a gap in the way the benchmark was originally constructed.

Look at the numbers

Benchmark Total number of discarded samples Discard rate ArabicMMLU 14,163 436 3.1% MizanQA1,769412.3% PalmX3,001250.8% MedAraBench4,960330.7% FannOrFlop6,984430.6% ArabCulture3,48270.2% MedArabiQ49910.2% GAT13,9861~0.0% 3LM STEM2,6091~0.0% AraDiCE-Culture18000.0% ArabLegalQA7900.0% AraTrust52200.0%

Classification of problems found

⚖️ Quality of answers

False or inconsistent gold index, factually incorrect answers, missing or raw text answers.

📄 Text and formatting quality

Corrupted or illegible text, spelling and grammatical errors, and duplicate samples.

💬 Cultural Sensitivity

Reinforcement of stereotypes and monolithic generalizations about diverse communities.

🤝 Gold Answer Compliance

Inconsistency between gold answers and evaluation protocols.

💻 Code Benchmarking: A different kind of quality work

Benchmarking the code required another intervention. Rather than discarding the sample, we refined the Arabic problem statement in the Arabic adaptations of 3LM’s HumanEval+ and MBPP+, leaving the task identifier, reference solution, and test suite completely unchanged.

The revision rate was amazing.

Benchmark Total Prompts Changed No Change Change Rate 3LM HumanEval+1641451988% 3LM MBPP+3783087081%

Changes fall into five categories:

Linguistic improvements: Normalized towards natural Modern Standard Arabic and a consistent imperative style Improved clarity: Fixed ambiguous instructions and unclear constraints Normalized consistency: Standardized formatting of mathematical terminology, punctuation, and examples Structural fixes: Fixed broken triple-quoted strings, indentation errors, and broken text fragments Semantic improvements: Clarified whether scopes were inclusive or exclusive and maintained the intent of the task

⚙️ Evaluation setup

Evaluation framework

QIMMA uses LightEval, EvalPlus, and FannOrFlop as evaluation frameworks, chosen for their consistency, multilingual community adoption, and reproducibility.

Metrics by task type

Task Type Metric Benchmark MCQ Normalized Log Likelihood AccuracyAraDiCE-Culture, ArabicMMLU, ArabCulture, PalmX, 3LM STEM, MedArabiQ, GAT, MedAraBench, AraTrust Multiple Choice MCQProbability Mass of Gold ChoicesMizanQA Generative QAF1 BERTScore (AraBERT v02)MedArabiQ, ArabLegalQA, FannOrFlop CodePass@13LM HumanEval+, 3LM MBPP+

prompt template

QIMMA standardizes question-based prompts using six template types:

All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original paper are saved.

🏆 Leaderboard Results

Actual results as of April 2026. Covers the top 10 rated models. Visit our live leaderboard for current rankings.

Rank Model AVERAGE AraDiCE-Culture ArabicMMLU ArabCulture PALMX 3LM STEM AraTrust MizanQA MedArabiQ ArabLegalQA GAT MedAraBench HumanEval+ MBPP+ FannOrFlop 🥇 1 Kwen/Kwen 3.5-397B-A17B-FP868.0682.7877.5461.7583.9188.6790.0473.3647.3054.9455.8947.9767.6876.7244.33 🥈 2 Applied Innovation Center/Karnak 66.2073.3380.9453.4981.4093.1089.0855.9255.7871.5861.0654.1933.5464.5558.91 🥉 3inceptionai/Jais-2-70B-Chat65.8178.8981.2983.2483.7387.9690.2371.7852.7969.6051.6750.8919.5143.6556.13 #4Qwen/Qwen2.5-72B-Instruct65.7577.2273.7863.8377.7787.5588.5163.4950.0670.7455.9044.1937.2072.7557.51 #5 Applied Innovation Center/AIC-165.3773.3372.0277.5276.1188.1390.6156.3653.7568.9662.1150.7828.0569.5847.83 #6 Kwen/Kwen 3.5-122B-A10B64.8474.4473.1737.7881.4686.1886.9764.0147.0455.1150.9052.4965.2472.4360.54 #7 Sakarti/Ultima-72B64.4978.3372.2868.7976.7583.7089.0860.4444.5869.1246.9142.2539.0274.0757.56 #8meta-llama/Llama-3.3-70B-Instruct63.9677.2271.5778.0577.9588.2885.6367.4456.2564.0051.1354.8627.4471.1624.43 #9Qwen/Qwen2.5-32B-Instruct63.2670.5668.7675.8072.0781.0385.8253.7848.0869.2756.9436.5134.1572.7593.10 #10FreedomIntelligence/AceGPT-v2-32B-Chat61.1476.6770.6279.7974.4684.8886.9763.8949.9671.4656.0447.3223.7854.5015.56

Scale does not guarantee best performance. The top 10 ranges from models with 32B to 397B parameters, with several medium models outperforming large models in certain domains. Arabic-specific models lead cultural and linguistic tasks. Jais-2-70B-Chat ranks highest in ArabicMMLU and ArabCulture, while Karnak ranks first in 3LM STEM and ArabLegalQA. Coding remains the most difficult area for Arabic-specific models. The top HumanEval+ and MBPP+ scores belong to multilingual models, with Qwen3.5-397B leading both.

Relationship between size and performance

Across the leaderboard (46 models), a clear but imperfect correlation between size and performance emerges. However, there are some interesting exceptions.

Arabic-specific models often outperform size-matched multilingual models Instruction-adjusted models consistently outperform base models, with the exception of Qwen3 Some small Arabic-specific models (Fanar-1-9B, ALLaM-7B) outperform larger multilingual models in certain domains

🌟 QIMMA Difference

The characteristic properties of QIMMA can be summarized as follows.

Property details Quality-first philosophy Validation before evaluation, not an afterthought Multi-model validation Two LLMs with different training, human review of flagged cases 99% native Arabic Almost completely avoids translation artifacts Multi-domain, multi-tasking 7 domains, 3 task types (MCQ, QA, code), 109 subsets Code evaluation First Arabic leaderboard with code generation Full transparency Per-sample inference output A unified, publicly available LightEval-based, reproducible assessment codebase, rather than just an aggregated score Dialect recognition Explicit handling of MSA and dialect variations in prompts and rubrics

🔗 Resources

🔖 Quote

@misc{alqadi2026arabicbenchmarksreliableqimmas, title={Are Arabic benchmarks reliable? QIMMA’s quality-first approach to LLM evaluation}, authors={Leen AlQadi, Ahmed Alzubaidi, Mohammed Alyafeai, Hamza Alobeidli, Maitha Alhammadi, Shaikha Alsuwaidi, Omar Alkaabi, Basma El Amel Boussaha, Hakim Hacid}, year={2026}, eprint={2604.03395}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2604.03395}, }

versatileai

See Full Bio

What's Hot

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Most Popular

Achieve density and score across distributions with one transformer

How NVIDIA AI-Q reached #1 on DeepResearch Bench I and II

New in llama.cpp: Model Management

Don't Miss

Takeda Pharmaceutical signs USD 600 million AI drug discovery agreement with Insilico

Google DeepMind and A24 begin research partnership

NVIDIA BioNeMo accelerates human clade science

Subscribe to Updates

What's Hot

Quality First Arabic LLM Leaderboard

🔍 Problem: Arabic NLP evaluation is fragmented and untested

⛰ What does QIMMA have?

🔬 Quality verification pipeline

Stage 1: Automatic multi-model evaluation

Stage 2: Human annotation and review

⚠️ What we found: Systemic quality issues

Look at the numbers

Classification of problems found

💻 Code Benchmarking: A different kind of quality work

⚙️ Evaluation setup

Evaluation framework

Metrics by task type

prompt template

🏆 Leaderboard Results

Relationship between size and performance

🌟 QIMMA Difference

🔗 Resources

🔖 Quote

Related Posts