Add Benchmaxxer Repellent to Open ASR Leaderboard

“Once a measure becomes a goal, it is no longer a good measure.” (Goodhart’s Law)

TLDR: Appen Inc. and DataoceanAI have provided a high-quality English ASR dataset covering scripted conversational audio across multiple accents. We keep these datasets private for high-quality measurements of performance on multiple tasks to prevent potential risks of benchmax or test set contamination.

We do not update average WER at this time. By default, the leaderboard’s average WER remains calculated on public datasets only. Optionally use the toggle to include private datasets and see the impact 👀

Since its launch in September 2023, the Open ASR Leaderboard has been accessed over 710,000 times. We are blown away by the interest and motivation of our community to continue pushing speech recognition forward 🗣️

Two words sum up the goals (but also the challenges) of maintaining a benchmark like the Open ASR Leaderboard.

Standardization: A model can have different rules for its usage and output, such as punctuation and capitalization. Datasets have the same challenges and can be structured differently. To achieve this goal, all test sets have been collected into one dataset on the hub for easy access and preview. Additionally, to standardize the model output and the dataset transcript, we use a normalizer that (among other things) removes punctuation and case, and maps to American spelling. Based on Whisper’s normalizer.

Openness: UI code and evaluation scripts are open source. This not only helped incorporate new models, but also improved the quality of the evaluation procedure through community feedback and contributions.

Standardization and openness are essential for meaningful benchmarking, but they also make benchmarks susceptible to benchmark-specific optimizations (“benchmaxing”) that allow models to improve leaderboard performance without a corresponding increase in real-world robustness. As models and use cases evolve, Open ASR Leaderboard continues to incorporate high-quality datasets and new evaluation settings to better reflect real-world performance and improve robustness to benchmark-specific optimizations.

As we explained in our report, there is no single “catch-all” ASR model. Some models perform better with American English, others perform better with diverse accents and multilingual settings, while others are optimized for speed and conversational voices. Different applications have different priorities. Therefore, a model that performs poorly in one dimension does not necessarily mean it is a bad model overall. The goal of the Open ASR Leaderboard is to capture these nuances and provide a more holistic view of ASR performance.

New high-quality private datasets

To this end, we collaborated with Appen Inc. and DataoceanAI to curate high-quality datasets for ASR benchmarks. Below is information about the various splits.

Dataset Accent Duration (h) Male (%) / Female (%) Style Transcription Appen Script AU Australian 1.42 49 / 51 Punctuation, read in upper and lower case. Appen Scripted CA Canadian 1.53 52 / 48 Read Punctuation, uppercase and lowercase letters included. Appen Scripted IN Indian 1.02 49 / 51 Read with punctuation, uppercase and lowercase letters. Appen Scripted US American 1.45 49 / 51 Finished Reading Includes punctuation and uppercase and lowercase letters. Appen Conversational IN Indian 1.37 51 / 49 Conversational, spontaneous Intermittent, with fluency. Appen Conversational US003 American 1.64 49 / 51 Conversational, spontaneous Punctuation, case sensitivity, fluency. Appen Conversational US004 American 1.65 49 / 51 Conversational, spontaneous Punctuation, lack of fluency. DataoceanAI Scripted United States American 2.43 54 / 46 Reading Punctuation, case (proper nouns), mismatch. DataoceanAI Scripted GB British 2.43 47 / 53 Reading Punctuation, mismatch. DataoceanAI Conversational American American 8.82 NA Conversational, spontaneous, intermittent, fluent. DataoceanAI Conversational GB UK 5.96 NA Conversational, spontaneous, punctuated, with fluency.

Below are audio samples showing various content (script, dialogue, acronyms, contradictions, proper nouns).

Although private datasets may seem contrary to the spirit of openness, we believe that including such datasets increases the credibility of the Open ASR Leaderboard. Benchmax is less likely to be exploited by model developers who use public test sets explicitly or who try to find training data that closely resembles a particular dataset in order to improve their scores on macro averages.

These datasets can also provide targeted metrics to highlight gaps and biases between controlled, often saturated settings (scripts, American accents) and more subtle conditions (conversational accents, and non-American accents). Below is a screenshot of the new (Private Data) tab.

The calculation method for each column is shown below.

“Average WER” calculates a macro average of the averages of the data providers, weighted equally. “Avg Scripted” performs a macro average of all scripted datasets. “Avg Conversational” performs a macro average of all conversation datasets. “Avg US” performs a macro average of all datasets of American accents. Non-US Average performs a macro average of all datasets with non-US accents.

We intentionally do not provide scores for each split to avoid model developers inflating scores with specific data providers or accents.

How can I evaluate my model based on this data?

Register your model on the Open ASR Leaderboard. Run the evaluation. As before, the process of adding models to the leaderboard takes place on the Open ASR Leaderboard GitHub.

When you open a pull request, you’ll see a checklist for your model. As before, you will need to report your results on a public dataset. Validate results on public sets and compute metrics on private sets. Check the results obtained.

While you wait for your model to be added to the Open ASR Leaderboard, you can self-report metrics on the public set by adding a YAML file like the following to your model card. The model will appear on the (unvalidated) leaderboard displayed on the dataset page (see screenshot below). Learn more about this approach to decentralized evaluation here.

Are there any advantages to models trained on data providers?

It’s done. We have asked Appen and DataoceanAI not to provide this data to our clients. However, even if this exact data is not provided, data from a similar distribution can be useful for modeling the corresponding evaluation set (similar to benchmarking by optimizing a difficult task from a public set). To this end, having multiple data providers balances the benefits that the model gains from using data from one of the providers. We also accept more data providers and evaluation sets under the Private Data tab.

Additionally, we defaulted the average WER so that the macro average does not include the private set, so that the private set does not affect the ranking of the model.

In the screenshot below, you can see that “Private Data” is turned off. This means that the macro average for the entire dataset does not include it.

Just toggle on the “Private Data” split to include them in the macro average.

The “Rank Δ” column shows how the order changes compared to the default macro average configuration. Including or excluding public datasets also changes the macro average, allowing users to tailor their evaluation to the use cases and data distributions most relevant to their application.

What’s next?

We look forward to hearing feedback from the community on how the new track and dataset switching features can help users identify the best model for their applications. We are also considering evaluations that better reflect noisy real-world situations. You can expect news on that 😉

When preparing the private evaluation set, we took special care to ensure that the audio and transcript quality was consistent across the dataset, including developing tools to identify difficult cases such as low signal-to-noise conditions and transcript mismatches. Because these factors can have a significant impact on WER. More details will be provided in a future post.

versatileai

See Full Bio

What's Hot

Add Benchmaxxer Repellent to Open ASR Leaderboard

Agentic AI governance is now a product. Is your company ready?

Introducing HCompany’s HoloTab. Your AI browser companion.

Agentic AI governance is now a product. Is your company ready?

Introducing HCompany’s HoloTab. Your AI browser companion.

Physical AI raises questions about governance of autonomous systems

DeepInfra on Hug Face Inference Provider 🔥

Soulgen revolutionizes the creation of NSFW content

Per-token AI fees coming to GitHub Copilot

Most Popular

DeepInfra on Hug Face Inference Provider 🔥

Soulgen revolutionizes the creation of NSFW content

Per-token AI fees coming to GitHub Copilot

Don't Miss

Add Benchmaxxer Repellent to Open ASR Leaderboard

Agentic AI governance is now a product. Is your company ready?

Introducing HCompany’s HoloTab. Your AI browser companion.

Subscribe to Updates

What's Hot

Add Benchmaxxer Repellent to Open ASR Leaderboard

New high-quality private datasets

How can I evaluate my model based on this data?

Are there any advantages to models trained on data providers?

What’s next?

Related Posts