Automatic measurement of the quality of text-to-speech (TTS) models is extremely difficult. Assessing the naturalness and inflection of a voice is a trivial task for humans, but it is much more difficult for AI. We are excited to announce this TTS Arena today. Inspired by LMSYS’s Chatbot Arena for LLMS, we have developed a tool that allows anyone to easily compare TTS models side by side. Send a text, hear two different models voice it, and vote for which model you think is the best. The results are organized into a leaderboard that displays the community’s highest rating model.
motivation
The field of speech synthesis has long been missing an accurate method of measuring the quality of different models. Objective measures such as WER (Word Error Rate) are unreliable measures of model quality, while subjective measures such as MOS (mean opinion score) are usually small-scale experiments with few listeners. As a result, these measurements generally do not help to compare two models of almost similar quality. To address these drawbacks, we invite the community to rank the models into easy-to-use interfaces. By opening this tool and spreading the results to the public, we aim to democratize the way models are ranked and make comparisons and selection of models accessible to anyone.
TTS Arena
Human rankings for AI systems are not a new approach. Recently, LMSYS has applied this method to chatbot arenas and has collected over 300,000 rankings so far. For its success, we adopted a similar framework on our leaderboards and invited anyone to rank the synthesized audio.
Leaderboards allow users to enter text. The text is combined between two models. After listening to each sample, the user votes to see which model sounds more natural. Due to human bias and risk of abuse, the model name will only be revealed after the vote is submitted.
Selected model
I have selected several SOTA (state-of-the-art) models for my leaderboard. Most are open source models, but it also includes several proprietary models to allow developers to compare the state of open source development with their own models.
The models available at startup are:
ElevenLabs (Used) MetaVoice OpenVoice Phem WhisperSpeech XTTS
There are many other open and closed source models, but I chose these as they are generally accepted as the highest quality public models.
TTS Leaderboard
Arena vote results will be published on a dedicated leaderboard. Note that it will be empty at first until sufficient votes have accumulated. The model will then gradually appear. The leaderboard will be automatically updated when the evaluator submits a new vote.
Like chatbot arenas, models are ranked using algorithms similar to the Elo Rating system commonly used in chess and other games.
Conclusion
We hope that TTS Arena proves to be a useful resource for all developers. I’d love to hear your feedback! Please let us know if you have any questions or suggestions by sending us an X/Twitter DM or by opening a discussion in the Community tab of the Space.
credit
Thank you to all those who made this possible, including Clémentine Forfried, Lucian Pouget, Yoach Lacombe, Main Horse, The Hugging Face Team and more. In particular, I would like to thank VB for his time and technical assistance. We would also like to thank Sanchit Gandhi and Apolinário Passos for their feedback and support during the development process.