Latency is an important parameter for voice AI. Although developers have made significant advances in model quality, the user experience is still often limited by response time. Hugging Face and Cerebras change that experience. Today we demonstrate what’s possible when you combine an open, modular voice AI architecture with industry-leading inference speed.
The result is a speech-to-speech experience that feels dramatically natural. Conversations flow with the responsiveness users expect from human interaction, rather than waiting for an AI to respond.
Architecture: Open cascading Speech-to-Speech stack
This demo is built as a real-time speech synthesis pipeline. Each part of the system is modular, open and interchangeable, allowing developers to easily adapt the stack to different assistants, robots, products or research projects.
This creates a completely open speech-to-speech loop.
Speech input -> Speech recognition with Nvidia’s Parakeet -> Gemma 4 VLM inference with Cerebras -> Text-to-speech with Alibaba’s Qwen3TTS -> Speech response
This architecture brings together the strengths of the open source AI ecosystem. Cerebras is used for fast inference, Google DeepMind’s Gemma 4 31B is used as a language model, and Qwen is used for text-to-speech. Developers can inspect, modify, and extend all layers.
Partnership between Cerebras and Hugface
Currently, on some production systems, the median latency is reasonable, but P95 still experiences a frustrating few seconds of latency. These delays are even more noticeable when tool calls or multimodal steps require multiple turns.
Cerebras helps solve one of the most important bottlenecks in your stack: language model response time. Cerebras allows the rest of your Hugging Face pipeline to shine by making inference dramatically faster and more stable.
Its stability is especially important in the long tail. Although many systems can achieve acceptable median response times, occasional slow responses can still make conversations feel unreliable.
Built for real-world interaction
This same Hugging Face speech synthesis pipeline is already powering Reachy Mini robots, with more than 9,000 robots in operation. For robots, voice assistants, and physical AI, responsiveness is more than just a cosmetic improvement. It gives the interaction a sense of life.
Therefore, the motivation for using Cerebras is not just cost savings. Low latency, predictable performance, and the ability to create real-time experiences that feel natural at scale.
This collaboration reflects a shared belief that the future of AI will be open and performant. Open source models, open infrastructure, and breakthrough inference speeds combine to create the foundation for the next generation of conversational AI.
Invite developers to explore demos, experiment with code, and help shape what’s next in real-time voice AI.
Demo: Hug Face Space
Repository: Hug Face/Speech to Speech

