Meta

Introducing Voicebox: The Most Versatile AI for Speech Generation

Takeaways

  • Voicebox is a generative AI model that can help with audio editing, sampling and styling.
  • This type of technology could be used in the future to help creators easily edit audio tracks, allow visually impaired people to hear written messages from friends in their voices, and enable people to speak any foreign language in their own voice.

Today, we’re announcing a breakthrough in generative AI for speech. We’ve developed Voicebox, a state of the art AI model that can perform speech generation tasks — like editing, sampling and stylizing — that it wasn’t specifically trained to do through in-context learning.

Voicebox can produce high quality audio clips and edit pre-recorded audio — like removing car horns or a dog barking — all while preserving the content and style of the audio. The model is also multilingual and can produce speech in six languages.

In the future, multipurpose generative AI models like Voicebox could give natural-sounding voices to virtual assistants and non-player-characters in the metaverse. They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more.

The versatility of Voicebox enables a variety of tasks, including:

In-context text-to-speech synthesis: Using an audio sample as short as two seconds long, Voicebox can match the audio style and use it for text-to-speech generation.

Speech editing and noise reduction: Voicebox can recreate a portion of speech that’s interrupted by noise or replace misspoken words without having to re-record an entire speech. For example, you can identify a segment of a speech that’s interrupted by a dog barking, crop it, and instruct Voicebox to re-generate that segment – like an eraser for audio editing.

Cross-lingual style transfer: When given a sample of someone’s speech and a passage of text in English, French, German, Spanish, Polish or Portuguese, Voicebox can produce a reading of the text in any of those languages, even when the sample speech and the text are in different languages. This capability could be used in the future to help people communicate in a natural, authentic way even if they don’t speak the same languages.

Diverse speech sampling: Having learned from diverse data, Voicebox can generate speech that is more representative of how people talk in the real world and in the six languages listed above.

Voicebox is an important step forward in our generative AI research, and we look forward to continuing our exploration in the audio space and seeing how other researchers build on our work.

Learn more about Voicebox.