Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Luján, colleagues have introduced bipartisan laws to improve AI testing and rating systems

May 14, 2025

A vague and whispering transcription using inference endpoints

May 14, 2025

Ministry of Information Finishes AI Media Training for Ministry of Home Affairs Staff

May 13, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Wednesday, May 14
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»A vague and whispering transcription using inference endpoints
Tools

A vague and whispering transcription using inference endpoints

versatileaiBy versatileaiMay 14, 2025No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
#image_title
Share
Facebook Twitter LinkedIn Pinterest Email

Today we’ll be introducing new, fiery, fast Openai Whisper deployment options for inference endpoints. This offers up to eight times the performance improvements compared to previous versions, and leveraging the incredible work done by the AI ​​community to deploy dedicated, powerful transcription models in a cost-effective way, making everyone one click.

Through this release, we want to make inference endpoints more community-centric and allow anyone to create incredible inference developments on face platforms that come and hug. In addition to the community, we would like to propose optimized deployments for a wide range of tasks using wonderfully available open source technologies.

The unique position of face-hugging at the heart of the open source AI community becomes the most uneven platform when working with individuals, institutions and industrial partners to deploy AI models to infer across a wide range of hardware and software.

Inference Stack

The new whisper endpoints leverage an astonishing open source community project. Inference is equipped with a VLLM project that provides an efficient way to run AI models across a variety of hardware families. In particular, it is not limited to NVIDIA GPUs. The VLLM implementation of Openai’s Whisper model can be used to enable further lower-level optimization of the software stack.

This first release targets NVIDIA GPUs with computing capabilities below 8.9 (Ada Lovelace) like the L4 & L40.

pytorch compilation (torch.compile) cuda graph float8 kv cache compilation torch. .compile generates a just-in-time (JIT) fashion optimized kernel that allows you to modify the calculation graph and reorder operations.

The CUDA graph records the sequential operations or kernel flows occurring on the GPU and attempts to group them as chunks of larger units of work to run on the GPU. This grouping operation reduces overhead by running data movement, synchronization, and GPU scheduling rather than multiple small units of work.

Finally, we are dynamically quantizing the activation to reduce the memory requirements that are caused by the KV cache. The calculation is done in this case with half-precision BFLOAT16, BFLOAT16, and the output is stored with reduced accuracy (2 bytes of 1 byte of FLOAT8 vs BFLOAT16), allowing the KV cache to store more elements, increasing the cache hit rate.

There are many ways to keep pushing this, and we are ready to work with the community to improve it!

benchmark

The Whisper Large V3 shows about 8x improvement over RTFX, allowing for much faster inference without loss in transcription quality.

We evaluated the transcription quality and runtime efficiency of several whisper-based models. Wiper parlarge V3, Whisper large V3 turbo, and distillation force V3.5 were distilled and both accuracy and decoding speed were evaluated under identical conditions compared to implementations in the transformer library.

Word error rates (WER) were calculated across eight standard datasets of the Open ASR Leaderboard, including AMI, GigaSpeech, Librispeech (Clean and Other), Spgispeech, Tedlium, Voxpopuli, and InningS22. These datasets span a variety of domains and recording conditions, ensuring a robust assessment of generalization and real-world transcriptional quality. Measure transcriptional accuracy by calculating the percentage of words that are incorrectly predicted (via insertion, deletion, or replacement). A low WER improves performance. All three whisper variants maintain performance comparable to the transformer’s baseline.

To assess inference efficiency, we sampled it from a long dataset of Rev16 containing audio segments over 45 minutes in length. Real-time coefficients (RTFX), defined as the ratio of audio duration to transfer time, were measured and averaged across samples. All models were evaluated with BFLOAT16 on a single L4 GPU using consistent decoding settings (language, beam size, and batch size).

Comparison of real-time factors

How to expand

By hugging the face endpoints, you can deploy your own ASR inference pipeline. Endpoints allow anyone to deploy AI models in a production-enabled environment and fill in several parameters. It also has the most complete fleet of AI hardware available on the market, tailoring to cost and performance needs. All of this is directly from where the AI ​​community is being built. There’s nothing easy to get started. Simply select the model you want to deploy.

inference

Performing inference on the endpoint of an expanded model can be done with just a few lines of code in Python. You can also use the same structure in JavaScript or in other languages ​​that are comfortable.

Here is a small snippet to quickly test the expanded checkpoints:

Import Request endpoint_url = “https://.cloud/api/v1/audio/transcriptions”
hf_token = 「hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
audio_file = “sample.wav”

Header= {“Approval”: f “Bearer {hf_token}“}

and open(audio_file, “RB”)) As F:files = {“file”:f.read()} response=requests.post(endpoint_url, headers=headers, files=files)respons.raise_for_status()

printing(“Transcript:”response.json() (“Sentence”)))

Fastrtc demo

This fiery, fast endpoint allows you to build real-time transcription apps. Try this example built with Fastrtc. Talk to the microphone and see the speech being transcribed in real time!

Spaces can be easily replicated, so feel free to replicate them. All of the above will be available to the community in a dedicated HF endpoint organization’s hugging facehub. Open problems, suggest using cases, contribute here: hfendpoints-images (inference endpoint images) 🚀

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleMinistry of Information Finishes AI Media Training for Ministry of Home Affairs Staff
Next Article Luján, colleagues have introduced bipartisan laws to improve AI testing and rating systems
versatileai

Related Posts

Tools

Gemini 2.5 Pro Preview: Even better coding performance

May 13, 2025
Tools

AI framework addresses LLM agent instability

May 12, 2025
Tools

Humanity adds AI to your favorite work tools

May 11, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Soulgen revolutionizes the creation of NSFW content

May 11, 20252 Views

UWI Five Islands Campus will host the AI ​​Research Conference

May 10, 20252 Views

Are AI chatbots really changing the world of work?

May 10, 20252 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Soulgen revolutionizes the creation of NSFW content

May 11, 20252 Views

UWI Five Islands Campus will host the AI ​​Research Conference

May 10, 20252 Views

Are AI chatbots really changing the world of work?

May 10, 20252 Views
Don't Miss

Luján, colleagues have introduced bipartisan laws to improve AI testing and rating systems

May 14, 2025

A vague and whispering transcription using inference endpoints

May 14, 2025

Ministry of Information Finishes AI Media Training for Ministry of Home Affairs Staff

May 13, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?