Today we’ll be introducing new, fiery, fast Openai Whisper deployment options for inference endpoints. This offers up to eight times the performance improvements compared to previous versions, and leveraging the incredible work done by the AI community to deploy dedicated, powerful transcription models in a cost-effective way, making everyone one click.
Through this release, we want to make inference endpoints more community-centric and allow anyone to create incredible inference developments on face platforms that come and hug. In addition to the community, we would like to propose optimized deployments for a wide range of tasks using wonderfully available open source technologies.
The unique position of face-hugging at the heart of the open source AI community becomes the most uneven platform when working with individuals, institutions and industrial partners to deploy AI models to infer across a wide range of hardware and software.
Inference Stack
The new whisper endpoints leverage an astonishing open source community project. Inference is equipped with a VLLM project that provides an efficient way to run AI models across a variety of hardware families. In particular, it is not limited to NVIDIA GPUs. The VLLM implementation of Openai’s Whisper model can be used to enable further lower-level optimization of the software stack.
This first release targets NVIDIA GPUs with computing capabilities below 8.9 (Ada Lovelace) like the L4 & L40.
pytorch compilation (torch.compile) cuda graph float8 kv cache compilation torch. .compile generates a just-in-time (JIT) fashion optimized kernel that allows you to modify the calculation graph and reorder operations.
The CUDA graph records the sequential operations or kernel flows occurring on the GPU and attempts to group them as chunks of larger units of work to run on the GPU. This grouping operation reduces overhead by running data movement, synchronization, and GPU scheduling rather than multiple small units of work.
Finally, we are dynamically quantizing the activation to reduce the memory requirements that are caused by the KV cache. The calculation is done in this case with half-precision BFLOAT16, BFLOAT16, and the output is stored with reduced accuracy (2 bytes of 1 byte of FLOAT8 vs BFLOAT16), allowing the KV cache to store more elements, increasing the cache hit rate.
There are many ways to keep pushing this, and we are ready to work with the community to improve it!
benchmark
The Whisper Large V3 shows about 8x improvement over RTFX, allowing for much faster inference without loss in transcription quality.
We evaluated the transcription quality and runtime efficiency of several whisper-based models. Wiper parlarge V3, Whisper large V3 turbo, and distillation force V3.5 were distilled and both accuracy and decoding speed were evaluated under identical conditions compared to implementations in the transformer library.
Word error rates (WER) were calculated across eight standard datasets of the Open ASR Leaderboard, including AMI, GigaSpeech, Librispeech (Clean and Other), Spgispeech, Tedlium, Voxpopuli, and InningS22. These datasets span a variety of domains and recording conditions, ensuring a robust assessment of generalization and real-world transcriptional quality. Measure transcriptional accuracy by calculating the percentage of words that are incorrectly predicted (via insertion, deletion, or replacement). A low WER improves performance. All three whisper variants maintain performance comparable to the transformer’s baseline.
To assess inference efficiency, we sampled it from a long dataset of Rev16 containing audio segments over 45 minutes in length. Real-time coefficients (RTFX), defined as the ratio of audio duration to transfer time, were measured and averaged across samples. All models were evaluated with BFLOAT16 on a single L4 GPU using consistent decoding settings (language, beam size, and batch size).
How to expand
By hugging the face endpoints, you can deploy your own ASR inference pipeline. Endpoints allow anyone to deploy AI models in a production-enabled environment and fill in several parameters. It also has the most complete fleet of AI hardware available on the market, tailoring to cost and performance needs. All of this is directly from where the AI community is being built. There’s nothing easy to get started. Simply select the model you want to deploy.
inference
Performing inference on the endpoint of an expanded model can be done with just a few lines of code in Python. You can also use the same structure in JavaScript or in other languages that are comfortable.
Here is a small snippet to quickly test the expanded checkpoints:
Import Request endpoint_url = “https://.cloud/api/v1/audio/transcriptions”
hf_token = 「hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
audio_file = “sample.wav”
Header= {“Approval”: f “Bearer {hf_token}“}
and open(audio_file, “RB”)) As F:files = {“file”:f.read()} response=requests.post(endpoint_url, headers=headers, files=files)respons.raise_for_status()
printing(“Transcript:”response.json() (“Sentence”)))
Fastrtc demo
This fiery, fast endpoint allows you to build real-time transcription apps. Try this example built with Fastrtc. Talk to the microphone and see the speech being transcribed in real time!
Spaces can be easily replicated, so feel free to replicate them. All of the above will be available to the community in a dedicated HF endpoint organization’s hugging facehub. Open problems, suggest using cases, contribute here: hfendpoints-images (inference endpoint images) 🚀