Benchmarks for speech models from wild text

Automatic measurement of the quality of text-to-speech (TTS) models is extremely difficult. Assessing the naturalness and inflection of a voice is a trivial task for humans, but it is much more difficult for AI. We are excited to announce this TTS Arena today. Inspired by LMSYS’s Chatbot Arena for LLMS, we have developed a tool that allows anyone to easily compare TTS models side by side. Send a text, hear two different models voice it, and vote for which model you think is the best. The results are organized into a leaderboard that displays the community’s highest rating model.

motivation

The field of speech synthesis has long been missing an accurate method of measuring the quality of different models. Objective measures such as WER (Word Error Rate) are unreliable measures of model quality, while subjective measures such as MOS (mean opinion score) are usually small-scale experiments with few listeners. As a result, these measurements generally do not help to compare two models of almost similar quality. To address these drawbacks, we invite the community to rank the models into easy-to-use interfaces. By opening this tool and spreading the results to the public, we aim to democratize the way models are ranked and make comparisons and selection of models accessible to anyone.

TTS Arena

Human rankings for AI systems are not a new approach. Recently, LMSYS has applied this method to chatbot arenas and has collected over 300,000 rankings so far. For its success, we adopted a similar framework on our leaderboards and invited anyone to rank the synthesized audio.

Leaderboards allow users to enter text. The text is combined between two models. After listening to each sample, the user votes to see which model sounds more natural. Due to human bias and risk of abuse, the model name will only be revealed after the vote is submitted.

Selected model

I have selected several SOTA (state-of-the-art) models for my leaderboard. Most are open source models, but it also includes several proprietary models to allow developers to compare the state of open source development with their own models.

The models available at startup are:

ElevenLabs (Used) MetaVoice OpenVoice Phem WhisperSpeech XTTS

There are many other open and closed source models, but I chose these as they are generally accepted as the highest quality public models.

TTS Leaderboard

Arena vote results will be published on a dedicated leaderboard. Note that it will be empty at first until sufficient votes have accumulated. The model will then gradually appear. The leaderboard will be automatically updated when the evaluator submits a new vote.

Like chatbot arenas, models are ranked using algorithms similar to the Elo Rating system commonly used in chess and other games.

Conclusion

We hope that TTS Arena proves to be a useful resource for all developers. I’d love to hear your feedback! Please let us know if you have any questions or suggestions by sending us an X/Twitter DM or by opening a discussion in the Community tab of the Space.

credit

Thank you to all those who made this possible, including Clémentine Forfried, Lucian Pouget, Yoach Lacombe, Main Horse, The Hugging Face Team and more. In particular, I would like to thank VB for his time and technical assistance. We would also like to thank Sanchit Gandhi and Apolinário Passos for their feedback and support during the development process.

versatileai

See Full Bio

What's Hot

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

One year since “Deep Seek Moment”

The most cost-effective AI model ever

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

One year since “Deep Seek Moment”

The most cost-effective AI model ever

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Most Popular

Improving the accuracy of multimodal search and visual document retrieval using the Llama Nemotron RAG model

5 ways rules and regulations guide AI innovation

Google’s industrial robot AI Play makes physical AI a priority

Don't Miss

JPMorgan ramps up investment in AI as technology spending approaches $20 billion

One year since “Deep Seek Moment”

The most cost-effective AI model ever

Subscribe to Updates

What's Hot

Benchmarks for speech models from wild text

motivation

TTS Arena

Selected model

TTS Leaderboard

Conclusion

credit

Related Posts