The Open Arabic LLM Leaderboard (OALL) is designed to address the growing need for specialized benchmarks in the Arabic processing domain. As the field of natural language processing (NLP) progresses, the focus is often heavily skewed into English, leaving a large gap in resources in other languages. Oall aims to balance this by providing a platform specifically for assessing and comparing the performance of leading Arabic Language Models (LLMs) and will promote research and development of Arabic NLP.
This initiative is especially important given its direct service to more than 380 million Arabic speakers worldwide. We hope that by increasing your ability to accurately assess and improve Arabic LLMS, OALL will play a key role in developing models and applications finely tuned to the nuances of Arabic, culture and heritage.
Benchmarks, metrics, and technology setup
Benchmark data set
The Open Arabic LLM Leaderboard (OALL) utilizes a wide range of robust datasets to ensure comprehensive model evaluation.
Alghafa Benchmark: Created by the TII LLM team, it aims to evaluate a variety of competency models, including reading comprehension, sentiment analysis, and answering questions. The original was introduced in 11 Native Arabic datasets and later expanded to include an additional 11 datasets, which are translations of other widely adopted benchmarks within the English NLP community. ACVA and ACEGPT Benchmarks: Paper “Acegpt, Localizing Large Language Models in Arabic” features 58 datasets, as well as MMLU’s translation versions and test benchmarks and evaluation spectra, covering a comprehensive range of language tasks. These benchmarks are meticulously curated and feature a variety of subsets that accurately capture the complexity and subtleness of Arabic.
Rating Metrics
Given the nature of tasks that include multiple selections and Yes/No questions, the leaderboard primarily uses normalized log likelihood accuracy for all tasks. This metric was chosen for its ability to provide a clear and fair measure of model performance across different types of questions.
Technology setup
Here is the technical setup for the Open Arabic LLM Leaderboard (OALL):
The front-end and backend inspired by Demo-Leaderboard, the backend runs locally on a TII cluster and runs the LightEval library to perform the evaluation. A significant contribution has been made to integrate the above Arabic benchmarks into LightEval and support the community’s out-of-use assessment of Arabic models (see GitHub PR#44 and PR#95 for more information).
Future direction
There are many ideas about expanding the scope of the Open Arabic LLM leaderboard. There are plans to introduce additional leaderboards in different categories as one for assessing Arabic LLM in Search Extended Generation (RAG) scenarios and as a chatbot arena that calculates ELO scores for different Arabic chatbots based on user preferences.
Additionally, we aim to extend the benchmark to cover more comprehensive tasks by developing an Opendolphin benchmark that contains approximately 50 data sets, Nagoudi et al. In a paper entitled “Dolphins: A challenging and diverse benchmark of Arabic NLG.” If you’re interested in adding benchmarks or collaborating on the Openolphin project, please visit the Discussions tab or this email address.
We would like to welcome your contributions in these points! We encourage you to contribute to the community by submitting models, suggesting new benchmarks, and taking part in discussions. It is also recommended that the community utilize the top models on the current leaderboard to create new models through Finetuning or other techniques that may help the model to rank up first. You can become the next Arabic open model hero!
We hope that the ol encourages technological advancements, highlighting unique linguistic and cultural characteristics unique to Arabic, and that technical setup and learning from developing large-scale, language-specific leaderboards will help with similar initiatives in other excess languages. This focus will help bridge the resource-research gaps traditionally dominated by English-centric models, enriching the global NLP landscape with more diverse and inclusive tools. This is important as AI technology becomes increasingly integrated into everyday life around the world.
Send the model!
Model Submission Process
To ensure a smooth assessment process, participants must adhere to certain guidelines when submitting their models to the Open Arabic LLM leaderboard.
Ensure model accuracy alignment: It is important that the accuracy of the submitted model matches the accuracy of the original model. Accuracy mismatches will evaluate the model, but may not display properly on the leaderboard.
Check before submission:
Model and Tokensor Load: Make sure the model and Tokensor can load properly using Autoclass. Use the following command:
from transformer Import autoconfig, autotokenizer config = autoconfig.from_pretrained(“Your model name”,Revision = Revision) Model = automodel.from_pretrained(“Your model name”revision = revision) tokenizer = autotokenizer.from_pretrained(“Your model name”revision = revision)
If an error occurs, follow the error message to ensure that the model is uploaded correctly and address it.
Model Visibility: Make sure your model is published. Additionally, if your model requires use_remote_code = true, please note that this feature is currently not supported, but is currently in development.
Converts the weights of the model into a safetenser.
Converts the weights of the model into a safetener. This is a safer and faster format for loading and using weights. This transformation also allows the extended viewer to include parameter counts for the model.
License and model cards:
Open License: Make sure the model is openly licensed. This leaderboard promotes the accessibility of open LLMS to ensure widespread ease of use. Complete Model Card: Enter your details on the Model Card. This data is automatically extracted and displayed along with the model on the leaderboard.
In the event of a model failure
If the model appears in the Failed category, this indicates that execution has stopped. Check the above steps to troubleshoot and resolve the issue. Additionally, test the following scripts in the model locally to check for functionality before resubmitting.
Acknowledgments
We appreciate all our contributors, partners, sponsors, especially the Institute of Technology Innovation and the embrace of our faces for substantial support in this project. While TII has generically provided important computational resources in line with its commitment to supporting community-driven projects and promoting open science in the Arabic NLP field, the embracing face helps integrate and customize new assessment frameworks and leaderboard templates.
We would also like to thank you for working on the Open KO-LLM leaderboard. Their pioneering contributions helped guide our approach to developing an inclusive and inclusive Arabic LLM leaderboard.
Quotations and references
@misc {oall, author= {el filali, ali and alobeidli, hamza and forrier, clémentineand boussaha, basma el amel and cojocaru, ruxandra and habib, nathan and hacid, hakim}, title= {open llm board} = “\ url {https://huggingface.co/spaces/oall/open-arabic-llm-leaderboard}”} @inproceedings {almazrouei-etal-2023-alghafa, title= “{a} l {g} hafa evaluation benchmark “Almazuloui, Ebtesam and Kohokar, Luxandra and Bardo, Michele and Malatic, Quentin and Arobaidri, Hamza and Mazzotta, Daniel and Penedo, Guillerme and Campesan, Julia and Fauk, Mugalya and Rahmadi, Maicha, Maicha, Maicha, Maicha, Rahmadi, Mugalya and Rahmadi Editors = “Sawaf, Hassan and El-Beltagy, Samhaa and Zaghouani, Wajdi and Magdi, Walid and Abdelali, Ahmed and Tomeh, Nadi and Abu Farha, Ibrahim and Habash, Nizar and Khalifa, Salam and Keleg, Hadem, Hatem, Zitouni, Mrian Almatham, Rawan “, booktitle=”Proceedings of Arabicnlp 2023 “, Month = dec, year=”2023 “, address=”singapore(hybrid)” “, association for computational linguistics “, url=”https://aclanthology.org/2023.arabicnlp-1.21 “, doi = doi = “10.18653/v1/2023.Arabicnlp-1.21”, Pages = “244–275”, Abstract=”Recent advances in the space of large-scale language models in Arabic have opened up rich potential practical applications. We present the rarity and limited assessment resources that contribute to this growing field, as well as new selection benchmarks for LLM in Arabic. A handmade data set consisting of 8 billion tokens. Finally, we compare the model with existing Arabic LLM to explore the quantitative toxicity of several Arabic models. and Hao Chen and Dingzi Song, Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruyu Sun and Ruyu Sun and Xiangwan and haizhou li and Jinchao Xu} eprint = {2309.12053}, archiveprefix = {arxiv}, primaryclass = {cs.cl}} @misc {lighteval, authors = {fourrier, clémentineand habib, nathan and wolf, thomas and tunstall, lewis}, llmater = {lighteval:lighteval:lighteval: {2023}, version = {0.3.0}, url = {https://github.com/huggingface/lighteval}}