Towards a robust assessment of Emirati dialect proficiency in Arabic LLM

Arabic is one of the most widely spoken languages in the world, with hundreds of millions of speakers in more than 20 countries. Despite its worldwide spread, Arabic is not a monolithic language. Modern Standard Arabic coexists with a rich landscape of regional dialects that differ widely in vocabulary, syntax, phonology, and cultural underpinnings. These dialects are the primary medium of everyday communication, oral storytelling, poetry, and social interaction. However, most of the existing benchmarks for large-scale language models for Arabic focus almost exclusively on Modern Standard Arabic, and dialectal Arabic is largely underrepresented and underrepresented.

This gap is especially problematic as large-scale language models increasingly interact with users in informal, culturally-based, and conversational environments. A model that works well with formal newswire text may not understand greetings, idiomatic expressions, or short anecdotes expressed in local dialects. To address this limitation, our team introduced Alyah الياه (meaning North Star ⭐️ in UAE), a UAE-centric benchmark designed to assess how well Arabic LLMs capture the linguistic, cultural, and pragmatic aspects of the UAE dialect.

Benchmark motivation and scope

The Emirates dialect is deeply connected to local culture, heritage and history. It appears in everyday greetings, oral poems, proverbs, folk tales, and expressions whose meaning cannot be guessed from a literal translation. Our benchmark is intentionally designed to explore this depth. Rather than testing surface-level vocabulary knowledge, we challenge models on their ability to interpret culturally embedded meanings, pragmatic usage, and dialect-specific nuances.

The benchmark covers a wide range of content, including common and unusual local expressions, culture-based greetings, short anecdotes, heritage-related questions, and references to UAE poetry. The goal is not only to measure accuracy, but also to understand where the model systematically succeeds or fails when faced with the use of authentic Emirati language.

Dataset structure

After further development and integration, the benchmarks were combined into a single dataset called Alyah. The final benchmark included 1,173 samples, all manually collected from native speakers in the United Arab Emirates to ensure language authenticity and cultural underpinnings. This manual curation step was essential to capture expressions, meanings, and usages that are poorly documented in written resources and difficult to infer from Modern Standard Arabic alone.

Each sample is designed as a multiple-choice question with four possible answers, one of which is correct. Distractor choices were generated synthetically using a large-scale language model and then reviewed to ensure plausibility and semantic closeness to the correct answer. To avoid positional bias during evaluation, the ground truth index follows a randomized distribution across the dataset. Below is the number of words and the distribution of possible answers for each query.

Aliya spans a wide range of linguistic and cultural phenomena in the Emirati dialect, from everyday expressions to culturally sensitive and figurative language. The distribution between categories is summarized below.

Category Number of Samples Difficulty Greetings and Everyday Expressions 61 Easy Religion and Social Sensitivity 78 Medium Imagery and Figurative Meaning 121 Medium Etiquette and Values 173 Medium Poetry and Creative Expression 32 Difficult History and Heritage Knowledge 89 Difficult Languages and Dialects 619 Difficult

Below are examples of each category.

This configuration allows Alyah to jointly assess surface-level conversational fluency and deeper cultural, semantic, and pragmatic understanding, with special emphasis on dialect-specific linguistic phenomena that remain a challenge for current models.

Setting up model evaluation

We evaluated a total of 54 language models, consisting of 23 base models and 31 instruction adjustment models, spanning several architectures and training paradigms. These include native Arabic LLMs such as Jais and Allam, multilingual models with strong Arabic support such as Qwen and LLaMA, and adaptive or region-specific models such as Farar and AceGPT. For each family, both base and instruction-tuned variants were evaluated to understand the effects of alignment and instruction tuning on dialect performance.

All models were evaluated under consistent prompting and scoring protocols. Answers were assessed for semantic accuracy and appropriateness regarding the use of Emirati rather than literal overlap with reference answers. This is particularly important in the case of dialect evaluation, where multiple valid expressions may exist.

For each question category, we empirically estimated the difficulty level based on model performance. Categories in which most models struggled were labeled as difficult, whereas categories in which they answered consistently correctly across the model family were considered easy. This approach allows difficulties to be revealed from observed behavior rather than from subjective annotations alone.

Aliya (Emirati dialect) evaluation results

We evaluate a wide range of modern Arabic and multilingual large-scale language models on Alyah, using multiple-choice question accuracy as a key metric. The evaluation covers a total of 53 models, including 22 base models and 31 instruction-adjusted models, spanning native Arabic, multilingual, and regionally adapted systems. Below is a radar plot showing the performance of the top models of various sizes for each question category.

These results are intended as a reference measurement within Alyah’s scope, rather than an absolute ranking across all Arabic benchmarks.

base model

Model accuracy google/gemma-3-27b-pt 74.68 tiiuae/Falcon-H1-34B-Base 73.66 FreedomIntelligence/AceGPT-v2-32B 67.35 google/gemma-3-4b-pt 63.17 QCRI/Fanar-1-9B 62.75 tiiuae/Falcon-H1-7B-Base 60.78 metal-llama/Llama-3.1-8B 58.23 Qwen/Qwen3-14B-Base 57.29 inceptionai/jais-adapted-13b 56.01 Qwen/Qwen2.5-32B 53.03 FreedomIntelligence/AceGPT-13B 50.81 Qwen/Qwen2.5-72B 47.91 Qwen/Qwen2.5-14B 46.8 google/gemma-2-2b 41.86 tiiuae/Falcon3-7B-Base 41.43 Qwen/Qwen3-8B-Base 40.75 tiiuae/Falcon-H1-3B-Base 40.41 Qwen/Qwen2.5-7B 36.57 Qwen/Qwen2.5-3B 35.29 Metalama/Llama-3.2-3B 35.12 inceptionai/jais-adapted-7b 33.5 Qwen/Qwen3-4B-Base 27.45

Instruction tuned model

Model accuracy falcon-h1-arabic-7b-instruct 82.18 humain-ai/ALLaM-7B-Instruct-preview 77.24 google/gemma-3-27b-it 74.68 falcon-h1-arabic-3b-instruct 74.51 Qwen/Qwen2.5-72B-Instruct 74.6 CohereForAI/aya-expanse-32b 73.66 Navid-AI/Yehia-7B-preview 73.32 FreedomIntelligence/AceGPT-v2-32B-Chat 72.8 Qwen/Qwen2.5-32B-Instruct 71.61 tiiuae/Falcon-H1-34B-Instruct 71.1 metal-llama/Llama-3.3-70B-instruction 69.74 QCRI/Fanar-1-9B-instruction 69.22 tiiuae/Falcon-H1-7B-instruction 65.13 CohereForAI/c4ai-command-r7b-arabic-02-2025 64.54 silma-ai/SILMA-9B-Instruct-v1.0 63.94 FreedomIntelligence/AceGPT-v2-8B-Chat 63.43 CohereLabs/aya-expanse-8b 61.21 yasserrmd/kallamni-2.6b-v1 61.13 yasserrmd/kallamni-4b-v1 60.7 Microsoft/Phi-4-mini-instruct 58.57 tiiuae/Falcon-H1-3B-Instruct 57.12 silma-ai/SILMA-Kashif-2B-Instruct-v1.0 48.51 Qwen/Qwen2.5-7B-Instruct 45.44 google/gemma-3-4b-it 46.12 metal-llama/Llama-3.1-8B-Instruct 46.29 metal-llama/Llama-3.2-3B-Instruct 39.64 yasserrmd/kallamni-1.2b-v1 37.77 Qwen/Qwen3-4B 26.26 google/gemma-2-2b-it 26.00 Qwen/Qwen3-14B 26.00 Kwen/Kwen 3-8B 25.66

Analysis and observed trends

Figure 1: Model accuracy across categories based on size.

Figure 2: Model accuracy across language-based categories.

Several trends emerge from the evaluation. As shown in Figures 1 and 2, the instruction coordination model typically performs better than the basic model. This is especially true for questions that involve conversational norms and culturally appropriate responses (i.e., the etiquette and values categories). as well as questions that test imagery and figurative meaning. This can be attributed to the model’s originally strong ability to understand MSA-based images and figurative language, regardless of dialect. The model can depict patterns of non-literal writing, regardless of dialect. In general, as shown in Figure 1, the most difficult categories for models were consistently “languages and dialects” and “greetings and everyday expressions,” regardless of model size. These results reflect the current state of the presence of the Emirati dialect in written media, as it is rarely spoken and rarely written, and explain its novelty compared to the evaluated models. Nevertheless, there are clear advantages to teaching models with an understanding of dialect (and other assessment categories) compared to their counterparts, especially for small and medium-sized models. This is particularly evident in the Poetry and Creative Expression category, where the large-scale teaching model performs slightly better than the small-scale model.

Figure 3: Average accuracy of the evaluated models.

As shown in Figure 3, even a powerful multilingual model shows a noticeable drop in the most difficult Aryan questions. This suggests that dialect-specific semantic knowledge cannot be easily acquired through general multilingual training alone. Although native Arabic models tend to perform more robustly for culturally based content, it is important to note that their performance is not uniform across all categories (Figure 2). In particular, questions with implicit meanings or rare expressions remain difficult for almost all assessment models. This highlights the persistent gap between surface-level dialect familiarity and deeper cultural understanding. The wide variation in performance across categories suggests that even models that excel at imagery and figurative meaning may still struggle with poetry and heritage-related creative questions, indicating that dialect proficiency is multidimensional and cannot be captured in a single score. Figure 3 shows that the highest scoring large model is Jais-2-70B, followed by two smaller models, jais-2-8B and ALLaM-7B-instruct. These are all Arabic instruction coordination models.

Conclusions and community impact

This benchmark represents a step towards a more realistic and culturally-based evaluation of Arabic models. By focusing on the Emirates dialect, we aim to support the development of models that better serve communities, institutions and users in the UAE. Beyond model ranking, this benchmark serves as a diagnostic tool to guide future data collection, training, and adaptation efforts.

We invite researchers, practitioners, and the broader community to use the benchmark, explore the results, and share their feedback. Input from the community is essential to improving our dataset, expanding our coverage, and ensuring that Arabic dialects receive the attention they deserve in the evaluation of large-scale language models.

quotation

@misc{emirati_dialect_benchmark_2026, title = {Alyah: Emirati dialect benchmark for evaluating large-scale language models of Arabic}, author = {Omar Alkaabi, Ahmed Alzubaidi, Hamza Alobeidli, Shaikha Alsuwaidi, Mohammed Alyafeai, Leen AlQadi, Basma El Amel Boussaha, Hakim Hacid}, year = {2026}, month = {January}, }

versatileai

See Full Bio

What's Hot

Gemini 3.1 Flash TTS: New Text-to-Speech AI Model

Agricultural drones are getting smarter for large farms

New AI models for the agent era

Gemini 3.1 Flash TTS: New Text-to-Speech AI Model

Agricultural drones are getting smarter for large farms

New AI models for the agent era

How to save millions of online casinos with artificial intelligence -5 important ways

Agricultural drones are getting smarter for large farms

Safetensors joins PyTorch Foundation

Most Popular

How to save millions of online casinos with artificial intelligence -5 important ways

Agricultural drones are getting smarter for large farms

Safetensors joins PyTorch Foundation

Don't Miss

Gemini 3.1 Flash TTS: New Text-to-Speech AI Model

Agricultural drones are getting smarter for large farms

New AI models for the agent era

Subscribe to Updates

What's Hot

Towards a robust assessment of Emirati dialect proficiency in Arabic LLM

Benchmark motivation and scope

Dataset structure

Setting up model evaluation

Aliya (Emirati dialect) evaluation results

base model

Instruction tuned model

Analysis and observed trends

Conclusions and community impact

quotation

Related Posts