Leading the Korean LLM evaluation ecosystem

In the rapidly evolving landscape of large-scale language models (LLMS), building an “ecosystem” has become more important than ever. This trend is evident in several major developments that embrace Face’s democratizing NLP and are on the stage in building a generative AI ecosystem.

Inspired by these industry milestones, we launched the Open KO-LLM leaderboard at Upstage in September 2023. Our goal was to quickly develop and implement the Korean LLM data assessment ecosystem and collaborate with global movements towards open and joint AI development.

Our vision for the open KO-LLM leaderboard is to nurture a vibrant Korean LLM assessment ecosystem and promote transparency by fostering researchers sharing their results and revealing hidden talent in the LLM field. Essentially, we strive to expand the LLM arena in Korea. To that end, we have developed an open platform where individuals can register Korean LLMs and compete with other models. Furthermore, we aimed to create a leaderboard that captures the unique characteristics and culture of Korean. To achieve this goal, we confirmed that translated benchmark datasets such as KO-MMLU reflect distinctive attributes of Korean.

Selecting Leaderboard Design: Creating a New Private Test Set for Fairness

The open KO-LLM leaderboard is characterized by its unique approach to benchmarking, particularly.

Adoption of Korean datasets in contrast to the common use of English-based benchmarks. In contrast to the open test sets of most leaderboards, private test sets have decided to build an entirely new data set to open the KO-LLM and keep it private, preventing pollution of the test sets and ensuring a more equitable comparison framework.

The decision to maintain a closed test set environment, while acknowledging the wider impact and potential for practicality to the research community through open benchmarks, was made with the aim of promoting a more controlled and fair comparative analysis.

Evaluation Task

The Open KO-LLM leaderboard employs five types of evaluation methods:

KO-ARC (AI2 Reasoning Challenge): KO-ARC is a multi-choice test designed to evaluate scientific thinking and understanding. It measures the inference ability required to solve scientific problems and evaluate complex reasoning, problem-solving skills, and understanding of scientific knowledge. Evaluation metrics focus on accuracy and measure the ability to effectively navigate and apply scientific principles, reflecting the frequency with which the model selects the correct answer from a set of options. Ko-Hellaswag: Ko-Hellaswag evaluates situational understanding and predictive ability as a generation format or as a multi-select setup. It tests the ability to take the situation into account and predict the most likely next scenario, serving as an indicator of the ability to understand and reason about the situation. Metrics include the accuracy of assessing the quality of predictions depending on whether they are approached as multiple selections. KO-MMLU (Understanding Large-Scale Multitasking Languages): KO-MMLU evaluates language understanding across a wide range of topics and fields in a multiple-selection format. This broad range of tests shows how well the models work in different domains and introduce versatility and depth to language understanding. The overall accuracy of task-specific overall performance is a critical metric, highlighting the advantages and disadvantages of various areas of knowledge. KO-Truthful QA: Ko-Truthful QA is actually a multi-select benchmark designed to assess the truthfulness of models and the accuracy of facts. Unlike the generation format in which the model freely generates responses in this multi-selection setting, the model is the task of selecting the most accurate and true answers from a set of options. This approach highlights the ability of models to identify truthfulness and accuracy within a constrained selection framework. The key metrics of KO’s serious QA focus on the accuracy of model selection, assessing consistency with known facts and their ability to identify the most true responses of the options offered. KO-Commongen V2: The newly created benchmark of the Open KO-LLM leaderboard evaluates whether LLM can generate output tailored to Korean common sense considering certain conditions, and tests the model’s ability to generate contextually and culturally relevant output in Korean.

Leaderboard in action: KO-llm’s Barometer

The Open KO-LLM leaderboard exceeds expectations, with over 1,000 models submitted. In comparison, the original English Open LLM leaderboard hosts over 4,000 models. The KO-LLM leaderboard reached a quarter of that number just five months after its launch. We appreciate this extensive participation, showing a vibrant interest in Korea’s LLM development.

Of particular note are the diverse competition covering individual researchers, companies, KT, Lotte Information and Communications, Yangorja, Megastimaum AI, 42MARU, Institute of Electronics and Telecommunications (ETRI), Kaisto, and the University of Korea. One outstanding submission is KT’s MI:DM 7B model. This not only outperforms the rankings between models with below 7B parameters, but also allows access to public use, marking important milestones.

We also observed that more generally, two models exhibit strong performance on the leaderboard.

Models that have been transitively transferred or tweaked across languages in Korean (such as Upstage’s solar) are tweaked from Llama2, Yi, and Mistral, highlighting the importance of utilizing solid foundation models for fine tuning.

Managing such a large leaderboard would not have come without its own challenges. The Open KO-LLM leaderboard aims to closely align with the philosophy of the Open LLM leaderboard, particularly in integration with the ecosystem of hugging face models. This strategy gives participants access to the leaderboard and makes it easier for participants to participate. This is an important element of its operation. Nevertheless, there are limitations due to the infrastructure that relies on the 16 A100 80GB GPU. This setup faces challenges when running a model that runs over 30 billion parameters, especially as it requires excessive amounts of calculation. This leads to long-term pending conditions for many submissions. Addressing these infrastructure challenges is essential to the future strengthening of the open KO-LLM leaderboard.

Our vision and next steps

We recognize some limitations of the current leaderboard model when considered in real-world contexts.

Obsolete data: Datasets such as Squad and Kleu become obsolete over time. Data evolves continuously and is continuously transformed, but existing leaderboards remain fixed in a specific time frame, resulting in hundreds of new data points being generated daily, thus not reflecting the present moment. Does not reflect the real world: With B2B and B2C services, data is constantly accumulated from users or industries, with edge cases or outliers occur continuously. The true competitive advantage lies in dealing with these challenges well, but current leaderboard systems do not have the means to measure this ability. Actual data is permanently generated, altered and evolved. The questionable meaning of competition: Many models are specially tuned to work well in the test set, which can lead to other forms of overfitting within the test set. Therefore, current leaderboard systems operate in a leaderboard-centric way, rather than in a real world-centric way.

Therefore, we plan to develop further leaderboards to address these issues and become a reliable resource that is widely recognized by many. By incorporating a variety of benchmarks that are strongly correlated with real-world use cases, we aim to make the leaderboard not only more relevant, but really useful to the company. We aim to bridge the gap between academic research and practical applications, and we will continuously update and strengthen our leaderboards to ensure that benchmarks remain rigorous, comprehensive and up-to-date through feedback from both the research community and industry practitioners. Through these efforts, we hope to contribute to field advancement by providing a platform that accurately measures and promotes advancements in large-scale language models in solving practical and impactful problems.

If you want to develop a dataset and work with us on this, we’re happy to talk to you. Please contact chanjun.park@upstage.ai or contact@upstage.ai!

As a side note, I believe that ratings in a real online environment are very meaningful, in contrast to benchmark-based assessments. Even within benchmark-based ratings, the benchmark must be updated monthly. Additionally, benchmarks should evaluate domain-specific aspects more specifically. We want to encourage initiatives like this.

Thank you to your partner

The Open KO-LLM Leaderboard journey began with a collaboration agreement to develop a Korean-style leaderboard in collaboration with South Korea’s leading national agencies, Upstage and The National Information Society Agency (NIA). The partnership marked the start signal and was able to launch the leaderboard within just a month. To test common sense reasoning, we collaborated with the research team of Professor Heusiousok Lim of Korea’s University to incorporate the Kocommongen V2 as an additional task for the leaderboard. Building a robust infrastructure was essential to success. To that end, we are grateful to Korea Telecom (KT) for their generous support of GPU resources and for their continued support. The Open KO-LLM Leaderboard encourages establishing direct communication with Hugging Face, the global leader in natural language processing, and is taking part in ongoing discussions to advance new initiatives. Additionally, the Open KO-LLM leaderboard boasts an honorable consortium of trusted partners: National Information Society Agency (NIA), Upstage, KT, and Korea University. The participation of these institutions, in particular the inclusion of national institutions, highlights the potential as a basis in the academic and practical exploration of language models, and gives important authority and credibility to their efforts.

versatileai

See Full Bio

What's Hot

Leading the Korean LLM evaluation ecosystem

Ionos announces strategic investment in cloud analogy to enhance AI capabilities, Global Salesforce

Welcome Gemma – Google’s new open LLM

Welcome Gemma – Google’s new open LLM

Fine-tuned Gemma model to hug your face

Introducing the Red Team Resistance Leaderboard

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

New Star: Discover why 보니 is the future of AI art

SK Telecom unveils cutting-edge AI innovations at CES 2025

Most Popular

Impact International | EU AI ACT Enforcement: Business Transparency and Human Rights Impact in 2025

New Star: Discover why 보니 is the future of AI art

SK Telecom unveils cutting-edge AI innovations at CES 2025

Don't Miss

Leading the Korean LLM evaluation ecosystem

Ionos announces strategic investment in cloud analogy to enhance AI capabilities, Global Salesforce

Welcome Gemma – Google’s new open LLM

Subscribe to Updates

What's Hot

Leading the Korean LLM evaluation ecosystem

Selecting Leaderboard Design: Creating a New Private Test Set for Fairness

Evaluation Task

Leaderboard in action: KO-llm’s Barometer

Our vision and next steps

Thank you to your partner

Related Posts