Close Menu
Versa AI hub
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

What's Hot

Gemini as a universal AI assistant

May 22, 2025

The easiest repository to train VLMs with pure pytorch

May 21, 2025

VEO – Google Deep Mind

May 21, 2025
Facebook X (Twitter) Instagram
Versa AI hubVersa AI hub
Thursday, May 22
Facebook X (Twitter) Instagram
Login
  • AI Ethics
  • AI Legislation
  • Business
  • Cybersecurity
  • Media and Entertainment
  • Content Creation
  • Art Generation
  • Research
  • Tools
Versa AI hub
Home»Tools»Pioneering the frontiers of audio production
Tools

Pioneering the frontiers of audio production

versatileaiBy versatileaiNovember 21, 2024No Comments6 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Pioneering The Frontiers Of Audio Production
Share
Facebook Twitter LinkedIn Pinterest Email

technology

Published October 30, 2024 Author

Zaran Borsos, Matt Sharifi, Marco Tagliasacchi

Our pioneering voice generation technology is helping people around the world interact with digital assistants and AI tools in more natural, conversational and intuitive ways.

Speech is central to human relationships. It helps people around the world exchange information and ideas, express emotions, and create mutual understanding. As the technology built to produce natural, dynamic audio continues to improve, richer and more engaging digital experiences are unlocked.

Over the past few years, we have pioneered the frontiers of audio generation, developing models that can create high-quality, natural-sounding audio from a variety of inputs, including text, tempo controls, and specific audio. The technology powers single-speaker audio in many Google products and experiments, including Gemini Live, Project Astra, Journey Voices, and YouTube’s automated dubbing, and helps people around the world use more natural, conversational, and intuitive digital assistants and We help people interact with AI tools.

Google recently collaborated with partners across Google to help develop two new features that can generate long-form conversations with multiple speakers to make complex content more accessible.

NotebookLM Audio Summaries turn uploaded documents into engaging and lively conversations. With one click, two AI hosts summarize your material, make connections between topics, and interact. Illuminate creates AI-generated formal discussions of research papers, making knowledge more accessible and understandable.

Here we provide an overview of the latest speech generation research that underpins all of these products and experimental tools.

Pioneering technology for audio generation

For many years, we have invested in speech generation research, exploring new ways to generate more natural interactions in our products and experimental tools. Previous research on SoundStorm demonstrated for the first time the ability to generate 30-second segments of natural dialogue between multiple speakers.

This extends our previous work, SoundStream and AudioLM, to apply many text-based language modeling techniques to audio generation problems.

SoundStream is a neural audio codec that efficiently compresses and decompresses audio input without sacrificing quality. As part of the training process, SoundStream learns how to map audio to different acoustic tokens. These tokens capture all the information needed to reconstruct the audio with high fidelity, including characteristics such as prosody and timbre.

AudioLM treats audio generation as a language modeling task and generates acoustic tokens for codecs such as SoundStream. As a result, the AudioLM framework makes no assumptions about the type or composition of the audio produced and is flexible enough to handle a variety of sounds without requiring architectural adjustments, making it well-suited for modeling multi-speaker interactions. Masu.

Example of a multi-speaker dialog generated by NotebookLM AudioOverview based on some potato related documentation.

Based on this research, our state-of-the-art speech generation technology produces two-minute dialogues with improved naturalness, speaker consistency, and acoustic quality, given a dialogue script and speaker turn markers. You can. The model performs this task in less than 3 seconds in one inference pass on a single Tensor Processing Unit (TPU) v5e chip. This means it generates audio more than 40 times faster than real-time.

Scaling the audio generation model

Extending a single-speaker generation model to a multi-speaker model became a matter of data and model capacity. More efficient speech to compress speech into a series of tokens at speeds as low as 600 bits/second without compromising output quality, allowing modern speech generation models to generate longer speech segments. I created a codec.

The tokens generated by our codec have a hierarchical structure and are grouped by time frame. The first token in a group captures phonetic and prosodic information, and the last token encodes acoustic details.

Even with the new audio codec, over 5000 tokens need to be generated to create a 2 minute interaction. To model these long sequences, we developed a specialized Transformer architecture that can efficiently process information hierarchies that match the structure of acoustic tokens.

Using this technique, acoustic tokens corresponding to interactions can be efficiently generated within a single autoregressive inference pass. Once generated, these tokens can be decoded back into an audio waveform using an audio codec.

An animation showing how a speech production model autoregressively generates a stream of audio tokens that are decoded back into a waveform consisting of the dialogue between two speakers.

To teach the model how to generate realistic interactions between multiple speakers, we pre-trained the model using hundreds of thousands of hours of audio data. Then there’s much more, consisting of unscripted dialogue with numerous voice actors, with high sound quality and accurate speaker annotations, and the realistic inconsistencies of the “ummms” and “ahs” of real dialogue. We fine-tuned it based on a small interaction dataset. In this step, we taught the model how to reliably switch speakers during generated interactions and output only studio-quality audio with realistic poses, tones, and timing.

In line with our AI principles and our commitment to developing and deploying AI technology responsibly, we have incorporated SynthID technology to watermark non-temporal AI-generated audio content from these models. , to prevent potential misuse of this technology.

A new speech experience awaits

We are currently focused on improving the model’s fluency and acoustic quality, adding more fine-grained control over features such as prosody, and exploring how best to combine these advances with other modalities such as video. I’m doing it.

The potential applications for advanced speech generation are vast, especially when combined with the Gemini family of models. From enhancing the learning experience to making content more universally accessible, we’re excited to continue pushing the boundaries of what’s possible with voice-based technology.

Acknowledgment

Authors of this work: Zalán Borsos, Matt Sharifi, Brian McWilliams, Yunpeng Li, Damien Vincent, Félix de Chaumont Quitry, Martin Sundermeyer, Eugene Kharitonov, Alex Tudor, Victor Ungureanu, Karolis Misiunas, Sertan Girgin, Jonas Rothfuss, Jake Walker, Marco Taliasacchi.

We would like to thank Leland Rechis, Ralph Leith, Paul Middleton, Poly Pata, Minh Truong, and RJ Skerry-Ryan for their important work on interaction data.

We are very grateful to our collaborators at Labs, Illuminate, Cloud, Speech, and YouTube for their great work in incorporating these models into their products.

We would also like to thank Françoise Beaufays, Krishna Bharat, Tom Hume, Simon Okumine, and James Zhao for their guidance on the project.

author avatar
versatileai
See Full Bio
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleAssistant Professor of Safe AI
Next Article Introducing the open leaderboard for LLMs in Japan!
versatileai

Related Posts

Tools

Gemini as a universal AI assistant

May 22, 2025
Tools

The easiest repository to train VLMs with pure pytorch

May 21, 2025
Tools

Gemini 2.5 update from Google Deepmind

May 21, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

Introducing walletry.ai – The future of crypto wallets

March 18, 20252 Views

Subscribe to Enterprise Hub with your AWS account

May 19, 20251 Views

The Secretary of the Ministry of Information will attend the closure of the AI ​​Media Content Training Program

May 18, 20251 Views
Stay In Touch
  • YouTube
  • TikTok
  • Twitter
  • Instagram
  • Threads
Latest Reviews

Subscribe to Updates

Subscribe to our newsletter and stay updated with the latest news and exclusive offers.

Most Popular

Introducing walletry.ai – The future of crypto wallets

March 18, 20252 Views

Subscribe to Enterprise Hub with your AWS account

May 19, 20251 Views

The Secretary of the Ministry of Information will attend the closure of the AI ​​Media Content Training Program

May 18, 20251 Views
Don't Miss

Gemini as a universal AI assistant

May 22, 2025

The easiest repository to train VLMs with pure pytorch

May 21, 2025

VEO – Google Deep Mind

May 21, 2025
Service Area
X (Twitter) Instagram YouTube TikTok Threads RSS
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms and Conditions
  • Disclaimer
© 2025 Versa AI Hub. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?