For more than a decade, Meta’s Fundamental AI Research (FAIR) team has focused on advancing the state of the art in AI through open research. As the field rapidly innovates, we believe that collaboration with the global AI community is more important than ever.
Today, we’re excited to share some of the most recent FAIR research models with the global community. We’re publicly releasing five models including image-to-text and text-to-music generation models, a multi-token prediction model and a technique for detecting AI-generated speech. By publicly sharing this research, we hope to inspire iterations and ultimately help advance AI in a responsible way.
Meta Chameleon Can Process and Generate Both Text and Images
We are publicly releasing key components of our Chameleon models under a research-only license. Chameleon is a family of mixed-modal models that can understand and generate both images and text. Just as humans can process the words and images simultaneously, Chameleon can process and deliver both image and text at the same time. While most large language models usually have unimodal results (where they turn texts into images, for example) Chameleon can take any combination of text and images as input and also output any combination of text and images. And the possibilities with Chameleon are endless: imagine generating creative captions for images or using a mix of text prompts and images to create an entirely new scene.
Multi-Token Prediction Helps Train AI Models to Predict Words Faster
Trained on large amounts of text, large language models (LLMs) are already helping people to generate creative text, brainstorm ideas and answer questions. LLMs have a simple training objective: predicting the next word. While this approach is simple and scalable, it’s also inefficient. It requires several orders of magnitude more text than what children need to learn the same degree of language fluency.
In April, we proposed a new approach to build better and faster LLMs by using multi-token prediction. Using this approach, we train language models to predict multiple future words at once – instead of the former one-at-a-time approach. In the spirit of responsible open science, we are releasing the pretrained models for code completion under a non-commercial, research-only license.
JASCO Offers More Control Over AI Music Generation
Generative AI has enabled people to explore their creativity in new ways, such as by turning a text prompt into a clip of music. While existing text-to-music models like MusicGen rely mainly on text inputs for music generation, our new model, JASCO, is capable of accepting various inputs, such as chords or beat, to improve control over generated music outputs.
This allows the incorporation of both symbols and audio in the same text-to-music generation model.
Results suggest that JASCO is comparable to the evaluated baselines considering generation quality, while allowing significantly better and more versatile controls over the generated music.
AudioSeal Helps Detect AI-Generated Speech
We are also releasing AudioSeal, which we believe is the first audio watermarking technique designed specifically for the localized detection of AI-generated speech. AudioSeal makes it possible to pinpoint AI-generated segments within a longer audio snippet.
Unlike traditional methods that rely on complex decoding algorithms, AudioSeal’s localized detection approach allows for faster and more efficient detection. This design enhances the detection speed by up to 485 times compared to previous methods, making it suitable for large-scale and real-time applications.
AudioSeal is being released under a commercial license. It’s just one of several lines of responsible research we have shared to help prevent the misuse of generative AI tools.
Increasing Diversity in Text-To-Image Generation Systems
It’s important that text-to-image models work well for everyone and reflect the geographical and cultural diversity of the world. To achieve this, we developed automatic indicators to evaluate potential geographical disparities in text-to-image models.
In addition, to understand how people in different regions vary in their perceptions of geographic representation, we conducted a large-scale annotation study. We collected more than 65,000 annotations and more than twenty survey responses per example covering appeal, similarity, consistency and shared recommendations for improved automatic and human evaluations of text-to-image models. This enables more diversity and better representation in AI-generated images.
Today, we’re releasing geographic disparities evaluation code and our annotations, which we hope will help the community improve diversity across their generative models.