This project addresses the key needs of the advancement of the Hebrew NLP. As Hebrew is considered a low-resource language, existing LLM leaderboards often do not have a benchmark that accurately reflects its unique characteristics. Today we are excited to introduce pioneering efforts to change this narrative. This is a new Open LLM leaderboard designed specifically to evaluate and enhance Hebrew language models.
Hebrew is a morphologically rich language with complex systems of roots and patterns. Words are constructed from roots with prefixes, suffixes, and infixes used to modify meaning, tension, or plural forms (among other features). This complexity leads to the existence of multiple valid word forms derived from a single route, effectively creating traditional tokenization strategies designed for morphologically simple languages. As a result, existing language models can have difficulty handling and understanding the nuances of Hebrew words accurately, highlighting the need for benchmarks that cater to these unique linguistic properties.
Therefore, LLM studies in Hebrew require a dedicated benchmark that specializes in language nuances and linguistic characteristics. Our leaderboard is set to fill this blank by providing robust assessment metrics on language-specific tasks and promoting open community-driven enhancement of Hebrew generative language models. We believe the initiative will become a platform for researchers and developers to share, compare and improve Hebrew LLM.
Leaderboard Metrics and Tasks
We developed four important datasets designed to test linguistic models for understanding and generation of Hebrew, regardless of their performance in other languages. These benchmarks use several shot prompt formats to evaluate the model, allowing them to adapt and respond correctly even in limited contexts.
Below is a summary of each benchmark included in the leaderboard. Visit the Leaderboard tab for a more comprehensive breakdown of each dataset, scoring system and quick builds.
Hebrew Question Answer: This task evaluates the ability of the model to understand and process the information presented in Hebrew, and focuses on accurate searching for understanding and answers based on context. Check the understanding of the Hebrew syntax and semantics of the model in the form of direct questions and answers.
Source: Test subset of HEQ dataset.
Emotional Accuracy: This benchmark tests the ability of the model to detect and interpret emotions in Hebrew texts. Evaluate the ability of the model to accurately classify statements as positive, negative, or neutral based on language cues.
Winograd Schema Challenge: Tasks are designed to measure the understanding of Hebrew pronoun resolution and models of contextual ambiguity. It tests the ability of models to correctly explain pronouns in complex sentences using logical reasoning and general world knowledge.
Translation: This task evaluates the proficiency of the model in translation between English and Hebrew. It assesses language accuracy, flow ency, and ability to maintain overall language meaning and emphasizes the ability of the model in bilingual translation tasks.
Technology setup
The leaderboard is inspired by the open LLM leaderboard and uses demo leaderboard templates. The submitted models are automatically deployed using Huggingface’s inference endpoints and evaluated via API requests managed by Lighteval Library. Implementation is simple, with the main task being to set up the environment. The rest of the code ran smoothly.
Please engage with us
We invite researchers, developers and enthusiasts to participate in this initiative. Whether you are interested in submitting a model for evaluation or taking part in discussions about improving language techniques in Hebrew, your contributions are important. For guidelines on how to submit models for evaluation, visit the Submit page on the Leaderboard or join the Leaderboard’s HF Space discussion page.
This new leaderboard is more than just a benchmark tool. We hope that the Israeli technological community will recognize and encourage the gaps in language technology research in Hebrew. By providing detailed, specific assessments, it aims to catalyze the development of not only linguistically diverse but culturally accurate models, paving the way for innovation that respects the richness of Hebrew. Take this exciting journey and recreate your language modeling landscape!
Sponsorship
The leaderboard is sponsored by DDR&D IMOD for Hebrew and Arabic Hebrew and Arabic NLP/The Israeli National Program: The Dicta: The Israel Center for Text Analysis and Webikes. I would like to extend my gratitude to Professor Reut Tsarfaty of Bar-Ilan University for scientific consultation and guidance.