🇨🇿 BenCzechMark is the first and most comprehensive assessment suite for assessing Czech Large Language Model (LLM) proficiency. It aims to test whether LLMs are able to reason and perform complex tasks in Czech. Generate and verify grammatically and semantically correct Czech words. Extract information and accumulate knowledge by answering questions about Czech culture and Czech-related facts. Run the language model originally trained to estimate probabilities for Czech text.
To achieve this, we sourced 50 tasks across nine categories. 90% of tasks contain native, untranslated content.
This blog introduces both the evaluation suite itself and the BenCzechMark leaderboard, which features over 25 open source models of various sizes.
📋 Tasks and Categories
🇨🇿 BenCzechMark (current version) is divided into nine categories to comprehensively evaluate LLM proficiency. For each task,
Manually design at least five prompts and record the best performance and variance between them. Distinguish between four types of tasks and associate them with metrics. Accuracy (Acc) measures multiple-choice (MC) tasks, exact matching (EM) measures tasks with free short-answer generation, and area under the receiver operating characteristic curve (AUROC)1 in a multiclass setting. (calculated as the average of all pairs) measures the performance of classification tasks without the need for threshold calibration. Ready-to-use language models are often biased by the class distribution in the training data, how prompts are structured, and the examples provided during inference. These biases can vary between models, leading to inconsistent predictions depending on the specific model and its effects. To ensure reliable decision-making on datasets with different class distributions, calibration is required to adjust the model’s predictions. However, you can avoid calibration altogether by using thresholdless metrics like AUROC, which focus on ranking rather than decision thresholds. This approach eliminates the need for calibration and allows for fairer model comparisons (see e.g. Zhao et al., 2021 for details on calibrating LLM). Word-level perplexity (Ppl) is associated with language modeling tasks. This quantifies the likelihood that the model will generate text, normalized by the number of words in the corpus.
The translated portion of the dataset (10% of the total) was mostly translated by CUBBITT LINDAT Translation, except for CsFever, where the authors used DeepL for translation.
This is a complete list of categories with datasets and metrics used.
Reading comprehension tests whether the system can extract answers to questions based on information provided in context. Belebele – Acc – Contains questions about manually translated web articles. SQAD3.2 – EM – is a well-established reading comprehension task in SQuAD format, taken from Wikipedia. Factual knowledge contains questions that test knowledge of the facts stored in the model. Umimet (5 tasks focusing on biology/chemistry/history/informatics/physics) – Acc – Problems for elementary and high school students from each topic. Source: umimeto.org TriviaQA – EM (translated using CUBITT) – Contains Q/A from the Trivia and Quiz League website (US-centric dataset). NaturalQuestions – EM (translated using CUBITT) – Contains Q/A from Google Search (US-centric dataset). We include these to ensure that the model does not forget EN-centric knowledge when prompted in Czech (i.e., after a possible domain transfer). Czech Language Comprehension targets the unique understanding of the syntactic structure and subtle meanings of the Czech language. CERMAT (Open/TF/MC) – EM/AUROC/Acc – focuses on understanding free/true/false/multiple-choice questions from 6th and 9th grade tests and state high school exams. I have placed Grammar Error Detection – AUC (Correct/Illegal Grammar Error Prediction Task) – Contains sentences from language learner essays. Agree – Acc – You must fill in the missing grammatical suffixes for past tense verbs. Language modeling tests the likelihood that the model will sample a particular Czech language sample. The Czech National Corpus – Ppl – contains seven tasks spanning oral, dialectal, historical, and other versions of the Czech language provided by ČNK. HellaSwag – Acc – (translated using CUBITT) requires choosing between four options for a reasonable continuation of the text. Czech Mathematical Reasoning quantifies how well a model handles and solves Czech mathematics tasks. Klokan QA – Acc – Czech Mathematics Contest Elementary/High School Questions. CERMAT – EM/Acc – Mathematics subsection of CERMAT Open/MC. Umimet (Mathematics) – Acc – Mathematics subsection of Umimet. Natural language inference tests whether text contains the information required by associated text pairs. Czech SNLI – AUROC (SNLI translated using CUBITT + manual corrections) – Tests the implications of hypotheses in the premise text. CSFever – AUROC (Czech version of the FEVER dataset with partial translation) – asks whether a claim is supported (at least partially) by evidence. CTKFacts – Same format as AUROC – CSFEVER, but manually sourced from Czech News Agency articles. Propaganda – AUROC – includes 13 tasks that predict various aspects of news articles, such as location, genre, and emotional theme. Named entity recognition determines whether the model recognizes different named entity types in text. CNEC2.0 – EM – Standard NER dataset in Czech court decisions – EM – NER derived from Czech Supreme/Constitutional Court decisions. Sentiment analysis quantifies how accurately a model estimates sentiment information in text. Subjectivity – AUROC – asks whether the text is subjective or objective. CzechSentiment (MALL/CSFD/FB) – AUROC – Sentiment analysis of product reviews, movie reviews, and Facebook comments. Document search focuses on identifying relevant documents. Historical IR – Acc – Multiple choice task to select relevant/irrelevant sentences to the query.
⚔️ Model duel and average score
Since you use different metrics for your tasks, they have different scales, so you can’t simply average them. Instead, we have introduced a new method to determine the final score. Let the models fight.
For all tasks and metrics, we calculate a test of statistical significance with α=0.05. This means that the probability that model A’s performance is equal to model B’s performance is estimated to be less than 0.05. Use the following tests, each with different statistical power:
ACC and EM: one-tailed paired t-test, AUROC: Goutte et al., 2005, Ppl: bootstrap-inspired Bayesian test.
Next, we calculate the model’s duel win score (DWS), or the percentage of duels won against all other models on that task. Finally, calculate the aggregate score as follows:
Category DWS: Average task score within category, average DWS: Average across category DWS.
This provides an easy-to-understand model score (macro average model win rate).
👑 BenCzechMark Leaderboard – Llama-405B takes the crown
To identify the best performing open source model in the suite, we evaluated 26 open weight models using the following parameters:
Maximum input length: 2048 tokens Few shots example: 3 truncation: Smart truncation (first truncates a small sample of shots, then truncates the task description) Log-probability aggregation: Average pooling (long document bias ) Chat template: Not used
The results can be examined in our space. The Llama-450B emerged as the overall winner, but it didn’t dominate every category. Interestingly, some models are better in certain areas. For example:
The Qwen-72B excelled in math and information retrieval, but lagged behind similarly sized models in other categories. The Aya-23-35B model excels in emotion and language modeling, but lags in various categories as well. The Gemma-2 9B delivers excellent results in Czech reading comprehension, outperforming much larger models.
🇨🇿 Do you think your model is good in Czech? Submit!
One of the main goals of BenCzechMark is to enable researchers to evaluate the functionality of their models in Czech, and to enable the community to train and discover better models in Czech.
If you know a model that stands out, please post it on the leaderboard to make the competition even more exciting.
We’ve put together an easy 3-step guide to help you get started quickly. This guide can be found in the (Submission) tab of the BenCzechMark space.
🌟 Acknowledgments
We would like to thank all the contributors from BUT FIT, FI MUNI, CIIRC CTU and Hugging Face for their valuable contributions to the realization of BenCzechMark.
We would also like to thank the organizations that provided source data for some tasks, namely Umímeto, CERMAT, and ČNK.
📚 Citations and references
@article{fajcik2024benczechmark, title = {{B}en{C}zech{M}ark: Czech-centric multitasking and multimetric benchmarking of language models with duel scoring mechanism}, authors = {Martin Fajcik, Martin Docekal , Jan Dolezal, and Karel Ondrej, Karel Benes, Jan Kapsa, Michal Hradis, Zuzana Nevervellova, Ales Horak, Michal Stefanik, Adam Zhirkovskiy, David Adamczyk, Jan Hula, Jan Sedivi, Hynek Kidrichek}, year={2024}, url= {(https://huggingface.co/spaces/CZLC/BenCzechMark)(https ://huggingface.co/spaces/CZLC/BenCzechMark)} Institution = {Brno University of Technology, Masaryk University, Czech Technical University of Prague, Huggingface }, }