As large-scale language models (LLMs) become increasingly integrated into our lives, it becomes important to assess whether they reflect the nuances and capabilities of a particular language community. For example, Filipinos are one of the most active ChatGpt users in the world and ranks fourth in ChatGpt traffic (US, India, Brazil (1) (2)), but despite this powerful usage there is no clear understanding of how LLM works in languages like Tagalog and Cebuano. Most of the existing evidence is anecdote, such as screenshots of ChatGpt responding in Filipino as evidence of Fluent. Instead, you need a systematic evaluation of the LLM functionality in the Philippine language.
Therefore, we have developed a comprehensive assessment suite that evaluates the capabilities of Filbench: Tagalog, Filipino (a standardized form of Tagalog), and Cebuano LLM.
It was used to provide a comprehensive assessment of performance in Filipino and rated over 20 cutting-edge LLMs on Filbench.
filbench
The Filbench assessment suite includes four main categories: cultural knowledge, classical NLP, reading comprehension, and generation. For example, the classic NLP category includes tasks such as sentiment analysis, while the generation task includes various aspects of translation. To ensure that these categories reflect priorities and trends in NLP research and usage, we curate based on historical research from 2006 to early 2024 (most of these categories only include untranslated content to ensure fidelity to the natural use of Philippine language))
Cultural Knowledge: This category tests the ability of linguistic models to remember factually and culturally specific information. For cultural knowledge, we curated various examples of testing the ability to abuse LLM’s regional and factual knowledge (global MMLU), Filipino-centered values (Kalahi), and word sense (Stingraybench). Classical NLP: This category includes a variety of information extraction and linguistic tasks, including named entity recognition, sentiment analysis, and text classification, where specially trained models were traditionally performed. This category includes instances of Cebuaner, tlunified-ner, and Universal Ner for named entity recognition, and a subset of SIB-200 and Balitanlp for text classification and sentiment analysis. Reading Comprehension: This category evaluates the ability of linguistic models to understand and interpret Filipino texts, focusing on tasks such as readability, comprehension, and natural language inference. This category includes instances of Cebuano Readability Corpus, Belebele, and Newsph NLI. Generation: We dedicate most of Filbench to test the LLM’s ability to faithfully translate texts, from English to Filipino, Cebuano to English. Includes a diverse set of test examples from documentation (NTREX-128), realistic text from volunteers (Tatoeba), and domain-specific text (TICO-19).
Each of these categories provides aggregated metrics. To create a single representative score, calculate the weighted average based on the number of examples in each category. This is called the Filbench score.
To simplify usage and set up, we built Filbench on top of Lighteval, an all-in-one framework for LLM evaluation. For language-specific assessments, we first defined English-to-tagalog (or Cebuano) translation pairs for general terms used in assessments such as “yes” (oo), “no” (Hindi), and “true” (totoo). I then implemented a custom task for the feature I care about using the provided template.
Filbench is now available as a set of community tasks in the official LightEval repository!
What did you learn from Filbench?
By evaluating some LLMs in Filbench, we have revealed some insights into how they work in the Philippines.
Search #1: Region-specific LLMs still lag behind GPT-4, but collecting data to train these models is still a promising direction
Over the past few years, there has been an increase in region-specific LLMs targeting Southeast Asian languages such as Sea-Lion and Seallm. These are open weight LLMs that can be downloaded freely from Huggingface. It is often found that ocean-specific LLMs are the most parameter efficient for the language and achieve the highest fill bench score compared to other models of size. However, the best ocean-specific models are still outperformed by closed-source LLMs like GPT-4O.
Building a region-specific LLM still makes sense. This is because we observe a performance improvement of 2-3% when continuously fine-tuning the base LLM using ocean-specific instructional adjustment data. This suggests that efforts to curate Filipino/marine-specific training data are highly relevant as they may lead to improved performance in Filbench.
Search #2: Translation in the Philippines remains a difficult task for LLMS
We also observe that in four categories of Filbench, most models suffer from generational capabilities. When inspecting failure modes during power generation, these include when the model fails to follow translation instructions, generating text excessively redundant text, or hallucinating another language on behalf of Tagalog or Cebuano.
Search #3: Open LLMS remains a cost-effective option for Filipino language tasks
The Philippines tends to have a lower internet infrastructure and average income (3) and requires accessible LLMs with high cost and computational efficiency. Through Filbench, we were able to identify LLMs in the Pareto Frontier of Efficiency.
In general, we find that open weight LLM, a model that can be freely downloaded from a Haggerface, is much cheaper than commercial models without sacrificing performance. If you want a replacement for GPT-4o for your Filipino language tasks, try the Llama 4 Maverick!

It also makes this information available in the Huggingface space of the Filbench Leaderboard.
Does your LLM work in the Philippine language? Try it on Filbench!
We hope that Filbench will provide deeper insight into the LLM function of the Filipino language and serve as a catalyst for advancement of NLP research and development in the Philippines. The Filbench evaluation suite is built on hugging Face’s Lighteval, allowing LLM developers to easily evaluate models with benchmarks. For more information, please see the link below.
Acknowledgments
The authors thank Cohere Labs for providing credits through Cohere Research Grant to run the AYA model series and hope to bring AI together for additional computational credits to run some open models. I also recognize the Hugging Face team, especially the Openmentine teams (Clémentine Fourrier and Nathan Habib) and Daniel Van Strien for their support in publishing this blog post.
Quote
If you are rating on Filbench, please quote our work.
@article {filbench, title = {fil {b} ench: {c} an {llm} s {u} nderstand and {g} {f} ilipino? }, author = {Miranda, Lester James V and Aco, Aco, Ellyanah and Manuel, Conner and Marvin Marbin} Joseph, journal = {arxiv preprint arxiv:2508.03523}, year = {2025}}