Face’s open LLM leaderboard (originally created by Ed Beeching and Lewis Tunstall and maintained by Nathan Habib and Clémentine Forfried) is well known for tracking the performance of open source LLMS and comparing performance on a variety of tasks, such as Truthfulqa and Hellaswag.
This is extremely valuable to the open source community as it provides a way for practitioners to track the best open source models.
In late 2023, Vectara introduced the Hughes Hallucination Evaluation Model (HHEM), an open source model for measuring the extent to which LLMs are hallucinated (producing unconscious or dishonest text in the source content provided). Covering both open source models such as the Llama 2 and Mistral 7b, and commercial models such as Openai’s GPT-4, anthropological Claude and Google’s Gemini, the model highlighted the harsh differences that currently exist between the models in terms of hallucination potential.
As I continued to add new models to HHEM, I was looking for an open source solution to manage and update my HHEM leaderboard.
Most recently, the Hugging Face Leaderboard Team has released a Leaderboard template (here and here). These are lightweight versions of the open LLM leaderboard itself, which are open source and are easier to use than the original code.
Today we are pleased to announce the release of a new HHEM leaderboard with HF leaderboard templates.
Vectara’s Hughes Hallucination Evaluation Model (HHEM)
The Hughes Hallucination Evaluation Model (HHEM) leaderboard is dedicated to assessing the frequency of hallucinations in document summaries generated by large-scale language models (LLMs), such as GPT-4, Google Gemini, and Meta’s Llama 2.
By making an open source release of this model, Vectara aims to democratize the assessment of LLM hallucinations, and recognizes the differences that exist in terms of trends present in LLM performance.
The first release of HHEM was the Huggingface model alongside the GitHub repository, but I quickly realized that there was a need for a mechanism to allow new types of models to be evaluated. We recommend that the LLM community submit new leaderboards for dynamic updates using the HF leaderboard code template quickly organize new leaderboards that allow for dynamic updates, and that the LLM community submits new relevant models for HHEM evaluations.
In a meaningful side note to us here at Vectara, HHEM is named after Pier Simon Hughes, who passed away in November 2023 without informing us of the natural cause. We have decided to name his honor because of his lasting legacy in this field.
Set up HHEM with LLM leaderboard template
To set up the Vectara hhem leaderboard, I had to adjust the HF leaderboard template code to suit my needs and adjust it as follows:
After cloning the space repository into its own organization, I created two related data sets: “request” and “results”. These datasets maintain the requests submitted by users for the new LLM to evaluate, and the results of such evaluations, respectively. I’ve entered the results dataset with existing results from the first launch and updated the About and Citations sections.
For a simple leaderboard where the results of the evaluation are pushed by the backend to the results dataset, that’s all you need!
As the evaluation is more complicated, I customized the source code to suit the needs of the HHEM leaderboard. Details are as follows:
Leaderboard/SRC/BackEnd/Model_operations.py: This file contains two main classes: summarygenerator and evaluationModel. a. The summarygenerator generates an overview based on the HHEM private rating dataset and calculates metrics such as response rates and average summary lengths. b. The evaluation model loads our own Fuse Hallucination Evaluation Model (HHEM) to evaluate these summaries, generating metrics such as de facto consistency and hallucination rate. Leaderboard/src/backend/evaluate_model.py: Defines the evaluator class that calculates and returns results in JSON format using both summarygenerator and evaluationModel. Leaderboard/src/backend/run_eval_suite.py: Contains the function run_evaluation to take advantage of the evaluator to retrieve and upload evaluation results and upload them to the results data set above, and is displayed on the leaderboard. Leaderboard/main_backend.py: Manages pending evaluation requests and performs automatic evaluation using the above classes and features. It also includes an option for users to reproduce the evaluation results.
The final source code is available in the Files tab of the HHEM Leaderboard Repository. All these changes allow the evaluation pipeline to be ready and easily deployed as a hagging face space.
summary
HHEM is a new classification model that LLMS can use to assess the extent of hallucination. The use of a hugging face leaderboard template provided the support needed for the common needs of the leaderboard. Ability to manage submissions of new model evaluation requests and new resulting leaderboard updates.
A great praise to the embracing face team to make this valuable framework open source and support the Vectara team in its implementation. We expect this code to be reused by other community members who are aiming to expose other types of LLM leaderboards.
If you want to contribute to HHEM using the new model, please send it to the leaderboard. We appreciate the suggestion of a new model to evaluate.
Also, if you have any questions about the frontend of HugFace LLM or Vectara, feel free to contact us via the Vectara or Huggingface forum.

