While everyone (and their grandma π΅) is launching new ASR models, choosing the right ASR model for your use case can feel more difficult than choosing the next Netflix show. As of November 21, 2025, the hub has 150 Audio-Text-to-Text models and 27,000 ASR models π€―
Most benchmarks focus on short-form English transcription, ignoring other important tasks such as (1) multilingual performance and (2) model throughput, which can be determining factors for long-form audio such as conferences and podcasts.
Over the past two years, the Open ASR Leaderboard has become the standard for comparing open and closed source models in both accuracy and efficiency. Multilingual and long-form transcription tracks were recently added to the leaderboard π
TL;DR – Open ASR Leaderboard
π New preprint on ASR trends from Leaderboard: https://hf.co/papers/2510.06961 π§ Best accuracy: Conformer encoder + LLM decoder (open source ftw π₯³) β‘ Fastest: CTC / TDT decoder π Multilingual: Single language performance sacrificed β Long form: Still closed source System Lead (for now π) π§βπ» Tweaking Guide (Parakeet, Voxtral, Whisper): To continue improving performance
As of November 21, 2025, the Open ASR Leaderboard compares over 60 open source and closed source models from 18 organizations across 11 datasets.
A recent preprint details the technical setup and highlights some important trends in modern ASR. Here are some important points π
1. Conformer encoder π€ LLM decoder tops the charts π
Currently, a model that combines a Conformer encoder and a Large Language Model (LLM) decoder leads in English transcription accuracy. For example, NVIDIA’s Canary-Qwen-2.5B, IBM’s Granite-Speech-3.3-8B, and Microsoft’s Phi-4-Multimodal-Instruct achieve the lowest word error rate (WER), demonstrating that integrating LLM inference can significantly improve ASR accuracy.
π‘ Pro Tip: NVIDIA has introduced Fast Conformer, a 2x faster variant of Conformer. It is used in the Canary and Parakeet model suites.
2. Speed-accuracy trade-off βοΈ
Although these LLM decoders are more accurate, they tend to be slower than naive approaches. In the Open ASR Leaderboard, efficiency is measured using the Reverse Real-Time Factor (RTFx), where higher is better.
To achieve even faster inference, CTC and TDT decoders deliver 10 to 100 times faster throughput at slightly higher error rates. This makes it ideal for real-time, offline, or batch transcription tasks (meetings, lectures, podcasts, etc.).
3. Multilingual π
OpenAI’s Whisper Large v3 remains a powerful multilingual baseline, supporting 99 languages. However, tweaked or distilled variants such as Distil-Whisper and CrisperWhisper often perform better than the originals on English-only tasks, demonstrating how targeted tweaking can improve your expertise (How to Tweak? Check out our guides to Whisper, Parakeet, and Voxtral).
That said, a focus on English tends to reduce multilingual coverage π This is a classic case of the trade-off between specialization and generalization. Similarly, self-monitoring systems such as Meta’s Massively Multilingual Speech (MMS) and Omnilingual ASR can support over 1,000 languages, but lag behind language-specific encoders in accuracy.
β Only five languages ββare currently benchmarked, but we plan to expand to more languages ββand look forward to contributing new datasets and models to multilingual ASR via GitHub pull requests.
π― Alongside multilingual benchmarks, several community-driven leaderboards focus on individual languages. For example, the Open Universal Arabic ASR Leaderboard compares models across Modern Standard Arabic and regional dialects, highlighting how phonetic variation and bilingualism pose challenges to current systems. Similarly. The Russian ASR Leaderboard provides a growing hub for evaluating encoder/decoder and CTC models for Russian-specific phonology and morphology. These localized efforts reflect the broader multilingual leaderboard mission to facilitate dataset sharing, fine-tuned checkpoints, and transparent model comparisons, especially in languages ββwith fewer established ASR resources.
4. Transcribing long texts is a different game β³
For long-form audio (podcasts, lectures, conferences, etc.), closed-source systems still outperform open systems. This could be due to domain tuning, custom chunking, or production-level optimization.
Among the open models, OpenAI’s Whisper Large v3 has the best performance. But when it comes to throughput, CTC-based Conformers are better π For example, NVIDIA’s Parakeet CTC 1.1B achieves an RTFx of 2793.75 compared to 68.56 for Whisper Large v3, with only a moderate reduction in WER (6.68 and 6.43, respectively).
What are the trade-offs? Parakeet is English-only, but once again reminds us of the trade-off between multilingualism and expertise π« .
β Closed systems still lead the way, but there is great potential for open source innovation here. Long-form ASR is one of the next most exciting frontiers for the community to tackle.
Given how quickly ASR is evolving, we are excited to see what new architectures improve performance and efficiency, and how the Open ASR Leaderboard continues to serve as a transparent, community-driven benchmark in this space and a reference for other leaderboards (Russian, Arabic, and audio deepfake detection).
Stay tuned as we continue to expand Open ASR LeaderBoard with more models, more languages, and more datasets π
π Want to contribute? Go to our GitHub repository and open a pull request π

