LLMs are now becoming increasingly proficient in English, but it is very difficult to know how well they will perform in other national languages, which are widely spoken but have their own linguistic challenges. is. Today, we are excited to bridge this gap for Japanese people.
We would like to announce the Open Japanese LLM Leaderboard, which consists of over 20 datasets ranging from classic to modern NLP tasks, to understand the underlying mechanisms of Japanese LLM. The Open Japanese LLM Leaderboard was built by LLM-jp, a cross-organizational project for research and development of Japanese Large-Scale Language Models (LLM) in partnership with Hugging Face.
Japanese language has its own unique challenges. Morphologically rich and constantly evolving due to historical and cultural interactions with other parts of the world, its writing system includes the ideograms of Simplified Chinese, Kanji, the phonetic writing system, and Hiragana (Hiragana). / Hiragana), Katakana (Katakana / Katakana), which is often used in foreign languages. Modern Japanese is probably one of the most difficult languages to process, and includes Chinese and native Japanese, Latin script (Romanization/Romanization), Dutch, Portuguese, French, English, loanwords from German, and even Arabic. words, traditional Chinese numbers. Additionally, Japan’s digital world has led to the evolution of kaomoji written in Unicode 🙂 and kaomoji using the Cyrillic alphabet. (っ°Д°;)tto Greek letter _φ(°-°=). Of course, let’s not forget the classic emoji, which originated in Japan with the rise in popularity of mobile phones in the 1990s.
Japanese’s complex writing system hides an extra layer of complexity: the lack of spaces between words. Like Chinese and Thai, Japanese does not have white spaces between linguistic units, making it very difficult to detect word boundaries during tokenization. Over the years, the vibrant Japanese ecosystem (from prestigious university labs and AI startups to industry-leading R&D centers) has incorporated the specificities of Japanese NLP to develop a modern, robust Japanese LLM. We have been in this area to compare these models.
We are therefore introducing the Open Japanese LLM Leaderboard, a collaboration between Hugging Face and LLM-jp, to promote research transparency and encourage an open source model development philosophy. We strongly believe that this initiative will serve as a platform for domestic and international researchers to collaborate, evaluate, and strengthen LLM in Japan.
Leaderboard task overview
The Open Japanese LLM Leaderboard uses a specialized evaluation suite llm-jp-eval to evaluate Japanese LLMs, ranging from classic tasks (natural language reasoning, machine translation, summarization, question answering, etc.) to more modern tasks. (e.g. code generation, mathematical reasoning, human inspection, etc.). The task is launched in 4 shots.
The datasets are compiled by LLM-jp’s evaluation team, and are either built from scratch with linguists, experts, and human annotators, or automatically translated into Japanese, adjusted to Japanese characteristics, and integrated. The section requires lengthy contextual reasoning. To better understand the leaderboard, we detail samples from eight datasets (Japanese followed by English translations in light gray). For more information on all available tasks, please see the Overview tab of the leaderboard and the official links for each dataset.
jump
Jamp (Controlled Japanese Temporal Inference Dataset for Evaluating the Generalization Ability of Language Models) is NLI’s Japanese temporal inference benchmark. This dataset explores pairs of English and Japanese sentences with different temporal inference patterns annotated with golden labels such as entailment, neutral, and contradiction.
JEMHopQA
JEMHopQA (Japanese Explainable Multi-hop Question Answering) is a Japanese multi-hop QA dataset that allows you to evaluate internal reasoning. This is a task that takes a question as input and produces an answer and a derivation.
jcommonsenseqa
jcommonsenseqa is the Japanese version of CommonsenseQA, a multiple-choice question answering dataset. The purpose of this dataset is to assess common sense reasoning ability.
Chabusa
chABSA was developed as an aspect-based sentiment analysis dataset. ChABSA is based on the 2016 securities reports of Japanese listed companies and annotates entity-attribute-sentiment pairs. More specifically, 230 of Japan’s 2,260 listed companies (approximately 10% of all companies) were annotated according to the classification of Japan’s financial regulator, the Financial Services Agency (FSA).
mbpp-ja
The mbpp-ja dataset is a programming dataset. This is the Japanese version of the Mostly Basic Python Problem Dataset (MBPP), translated from English to Japanese by LLM-jp using the translation tool DeepL.
mouse
Dataset The Japanese MAWPS dataset, based on MAWPS (Mathematics Word Problems Repository), is a mathematical assessment dataset. This version uses step-by-step reasoning, also known as Chain of Thought (CoT) reasoning, to assess your ability to solve new tasks. Also, the names of people, units, and places have been adjusted to translate to fit the Japanese context. The level of mathematical reasoning is fairly simple. Addition, subtraction, multi-step arithmetic, single or paired equations.
JMMLU
JMMLU is a knowledge dataset that uses multiple-choice questions. It consists of questions translated into Japanese from part of the MMLU dataset that assesses knowledge for high school level exams. Questions and answers based on 57 subjects, including astronomy, chemistry, sociology, and international law, were translated into Japanese to fit Japan’s unique cultural background, including Japanese civics, Japanese geography, and idioms.
XL-Sum
XL-Sum is a summary dataset based on research titled XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages, which leverages Japanese translations of BBC News articles. The dataset is divided into three parts. Title, main text (full article), abstract. Topics include global issues, politics, technology, sports, culture, and more.
Technical setup
The leaderboard is inspired by the Open LLM Leaderboard. Submitted models are automatically deployed using HuggingFace’s inference endpoint, using a memory-efficient inference and serving engine, vLLM, version v0.6.3, and llm-jp- version 1.14.1. It is evaluated through the eval library and calculated on the backend. By mdx, a premium computer platform for research in Japan.
observation
According to the Japanese LLM guide Awesome Japanese LLM (available in Japanese, English, and French), Meta’s LLama open source architecture seems to be a favorite of many Japanese AI labs. However, other architectures are also being used successfully in the Japanese open source community, such as Mistral from France’s Mistral and Qwen from Alibaba in China. These are the architectures with the highest scores on the Japanese LLM leaderboard.
For general language processing tasks, the Japanese LLM, which is based on an open source architecture, can be compared to a closed source LLM, such as the Japanese LLM llm-jp-3-13b-instruct, developed by LLM-jp and funded by universities. It has been observed that the gap between and achieves similar performance to the closed source model. Domain-specific datasets such as chABSA (finance), Wikipedia annotated corpus (language annotations), code generation (mbpp-ja), and summaries (XL-Sum) remain a challenge for most LLMs. Interestingly, models originating from companies and labs based in Japan have better scores on certain JCommonsenseMorality datasets. Evaluate the model’s ability to make choices in accordance with Japanese values when faced with ethical dilemmas.
Future direction
The Open Japanese LLM Leaderboard tracks the development of the assessment tool llm-jp-eval and reflects the constant evolution of the Japanese LLM. Below are just a few examples of future directions for llm-jp-eval that we would like to support. Please feel free to contact us to lend a hand or suggest direction.
New dataset: More Japanese evaluations The evaluation team at llm-jp-eval is working on this section, currently adding JHumanEval (Japanese version of HumanEval) and MMLU (Measuring Massive Multitask Language Understanding) .
New Evaluation System: Thought Chain Evaluation To better understand how the model works, we would like to compare LLM’s performance when using thought chain prompts and when using basic prompts.
Support for a new metric: percentage of non-selections For some evaluation tasks, such as natural language inference, where there is already a clear list of labels used for a particular task, you can use a complementary metric to test how often the model predicts. I would like to add it. Unselected token. Prompts provide choices so you can assess how well each LLM is able to follow certain instructions.
Acknowledgment
Built by research consortium LLM-jp, the Open Japanese LLM Leaderboard is proudly sponsored by the National Institute of Informatics, Tokyo, in collaboration with the mdx program, a high-performance computing platform.
We would like to thank Professor Yusuke Miyao and Nam-ki Han from the University of Tokyo for their scientific consultation and guidance, and Clémentine Fourrier and Toshihiro Hayashi from Hugging Face for their assistance in integrating and customizing the new assessment. I would like to express this. Framework and leaderboard templates.