Over the past few months, many new real-time voice models have been released, with the company being established, focusing on both open and closed source models. To list a few milestones, Openai and Google have released live multimodal APIs for ChatGpt and Gemini. Openai went to release the 1-800-chatgpt phone number! Kyushui has released Moshi, an audio LLM from completely open source audio. Alibaba has released QWEN2-AUDIO and FIXIE.AI, Ultravox-2 open source LLMs, and natively understand audio. ElevenLabs raised $180 million in Series C
Despite the explosion on the models and fundraising side, building real-time AI applications that stream audio and video, especially in Python, remains challenging.
ML engineers may not have the technology experience required to build real-time applications such as WeBRTC. Even code assistant tools like Cursor and Copilot are struggling to write Python code that supports real-time audio/video applications. I know from experience!
That’s why we’re excited to announce Fastrtc, Python’s real-time communications library. The library is designed to make it easy to build fully real-time audio and video AI applications in Python!
In this blog post, we will advance the basics of FASTRTC by building real-time audio applications. Finally, we understand the core features of FASTRTC.
Auto voice detection and turn get built-in, so you need to worry about the logic to respond to the user. 💻Automatic UI – Built-in WeBRTC-compatible gradient UI for testing (or deployment to production!). Call Phone by Phone – Get a free phone number using FastPhone() to call the audio stream (HF token is required; increased Pro account limit). webrtc and websocket support. Customizable – Stream can be mounted in any Fast API app to provide custom UIs and allow deployment beyond Gradio. text-Many utilities for speech, speech-to-text, stop word detection will start you.
Let’s dive in.
Get started
First, we’ll start by building “Hello World” for real-time audio. Fight back what users say. In fastrtc, this is as simple as:
from Fastrtc Import Stream, ReplyOnPause
Import numpy As np
def echo(audio: Tuple(intnp.ndarray)) -> Tuple(intnp.ndarray):
yield Audio Stream = Stream (ReplyOnPause (Echo), Modality =“audio”mode=“send-Receive”)stream.ui.launch()
Let’s break it down:
ReplyOnPause handles voice detection and turns for you. You need to worry about the logic to respond to the user. A generator that returns a tuple of audio (represented as (sample_rate, audio_data)) works. Stream classes build a gradient UI for quick testing of streams. Once prototyping is complete, the stream can be deployed as a production-enabled FASTAPI app in a single line of single code – Stream.Mount (APP). The app is a FastAPI app.
It’s working here:
Level up: LLM Voice Chat
The next level is to respond to the user using LLM. Using LLMS is extremely easy as FASTRTC comes with built-in speech-to-text and text-to-speech functionality. Change the echo function accordingly.
Import OS
from Fastrtc Import (ReplyOnPause, Stream, get_stt_model, get_tts_model)
from Openai Import openai sambanova_client = openai(api_key = os.getenv(“Sambanova_api_key”), base_url =“https://api.sambanova.ai/v1”
)stt_model = get_stt_model()tts_model = get_tts_model()
def echo(audio): PRONT = STT_MODEL.STT(Audio)Response = sambanova_client.chat.completions.create(model =“Metalama-3.2-3B-Instruct”message = ({“role”: “user”, “content”:prompt}), max_tokens =200,) prompt = responses.choices(0).message.content
for audio_chunk in tts_model.stream_tts_sync(prompt):
yield audio_chunk stream = stream(ReplyOnPause(echo), modality =“audio”mode=“send-Receive”)stream.ui.launch()
I use the Sambanova API because it’s fast. get_stt_model() gets the moonshine base and get_tts_model() gets the heart from the hub. However, you can also use any LLM/text-to-speech/speech-to-text API or speech-to-speech model. Bring the tools you love – Fastrtc just handles real-time communication layers.
Bonus: I’ll call by phone
If you call stream.fastphone() instead of Stream.ui.launch(), you get the free phone number and call the stream. Be careful, you will need a token with a hugging face. Increased Pro account limits.
You will see something like this in your terminal:
Information: Your FastPhone is live now! Call +1 877-713-4471 and connect to the stream using code 530574. Info: There are 30:00 left in quota (reset to 2025-03-23)
Then call the number and connect to the stream!
Next Steps
To learn more about the basics of fastrtc, read the documentation. The best way to start building is to check out the cookbook. Learn how to integrate with popular LLM providers (including Openai and Gemini real-time APIs), integrate streams with FastAPI apps and customize them. Starring Repo and file bugs and publishing requests! For updates, follow Fastrtc Org on Huggingface and check out the expanded examples!
Thank you for checking out Fastrtc!