Meta

Introducing the First Self-Supervised Algorithm for Speech, Vision and Text

Update on December 13, 2022 at 7:00 AM PT:

Today we’re announcing data2vec 2.0, a new algorithm that achieves the same accuracy as the most popular existing algorithm for computer vision but is 16x faster. It learns the same way in different modalities and also improves training efficiency for speech and text data.

We hope that general and efficient self-supervised learning algorithms such as data2vec 2.0 will lead to machines that can deeply understand extremely complex data, such as the contents of an entire movie.

To make our research accessible to other researchers, we’re also now sharing the code and pretrained models.

Learn more about data2vec 2.0.

Originally published on January 20, 2022 at 9:00 AM PT:

Today, we’re announcing data2vec, the first high-performance self-supervised algorithm that learns the same way in multiple modalities, including speech, vision and text. Most machines learn exclusively from labeled data. However, through self-supervised learning, machines are able to learn about the world just by observing it and then figuring out the structure of images, speech or text. This is a more scalable and efficient approach for machines to tackle new complex tasks, such as understanding text for more spoken languages. 

Self-supervised learning algorithms for images, speech, text or other modalities function in very different ways, which has limited researchers in applying them more broadly. Because an algorithm designed for understanding images can’t be directly applied to reading text, it’s difficult to push several modalities ahead at the same rate. With data2vec, we’ve developed a unified way for models to predict their own representations of the input data, regardless if it’s speech, text or audio. By focusing on these representations, a single algorithm can work with completely different types of input.

With data2vec, we’re closer to building machines that learn about different aspects of the world around them without having to rely on labeled data. This paves the way for more general self-supervised learning and brings us closer to a world where AI might use videos, articles, and audio recordings to learn about complicated subjects, such as the game of soccer or different ways to bake bread. Data2vec will also enable us to develop more adaptable AI, which we believe will be able to perform tasks beyond what’s possible today. 

If you’re a researcher interested in building upon our work, you can access the open source code and release pretrained models on GitHub.

Learn more about data2vec.