Inside Facebook Reality Labs Research: The Future of Audio

The audio team at Facebook Reality Labs Research is working on novel technologies to enable both audio presence and perceptual superpowers, letting us hear better in noisy environments with our future AR glasses. In the latest entry of our Inside Facebook Reality Labs series, we take you behind the scenes with the team for an in-depth look at their current demos and prototype technology.

As Arthur C. Clarke famously writes, “Any sufficiently advanced technology is indistinguishable from magic.” I recently had the opportunity to witness some of the finest prestidigitation that Facebook has to offer — and I’m here to take you along for the ride.

But first, a little background.

Whether it’s the comfort of a loved one’s voice or the passion of a song lyric, sound contains an emotional richness unlike any other sensory experience. Yet too often, that experience is drowned out by noise, degraded by distance, or lost through the limitations of our own hearing abilities.

It doesn’t have to be that way. Imagine putting on a VR headset or a pair of AR glasses and being transported thousands of miles away to attend class, go to work, or attend a relative’s birthday party — as if you were there in real life. This experience is known as “social presence.” Today’s technology falls short on that promise, in part because of unrealistic sound. How many times have you had to repeat yourself because of a noisy background, or lost track of a conversation because you couldn’t tell who was saying what?

Even when we’re in the same geographic location, the type of environment affects the quality of human connection. Noisy backgrounds get in the way, often causing us to stay quiet, get frustrated, or end up losing our voice from all the shouting. Now imagine that same pair of AR glasses takes your hearing abilities to an entirely new level and lets you hear better in noisy places, like restaurants, coffee shops, and concerts. What would this do for the quality of your in-person interactions?

At Facebook Reality Labs Research, we’re building the future of augmented reality (AR) and virtual reality (VR). FRL Research has brought together a highly interdisciplinary audio team made up of research scientists, engineers, designers, and more, all striving to improve human communication through radical audio innovation. The mission of the team is twofold: to create virtual sounds that are perceptually indistinguishable from reality and to redefine human hearing. To do that, the team focuses on delivering two new capabilities: first, audio presence — the feeling that the source of a virtual sound is physically present in the same space as the listener, with such high fidelity that it’s indistinguishable from a real-world source; and second, perceptual superpowers — technological advancements that let us hear better in noisy environments by turning up the volume on the person sitting across from us while turning down the volume on unwanted background noises.

Research Scientist Manager Ravish Mehra portrait — Research Scientist Manager Ravish Mehra

One of the largest audio research teams in the world covering a diverse range of interconnected research problems — growing from a single person to a team of world-class experts in the span of just six years — the FRL Research audio team led by Ravish Mehra works on solving novel research problems, generating proof-of-concept solutions, and proving those solutions out via compelling experiences. I was able to participate in some of those experiences, and the implications for the future of audio communication are astounding. This is the story of the future of communication, which will require inventing an entirely new stack of hardware and software technologies that deliver authentic, embodied experiences.

Hearing Is Believing: Audio Presence

While he wanted to grow up to be a rock star, Research Scientist Pablo Hoffmann is closer to a magician today. He’s successfully developed an always-on audio calibration system that effectively lets you hear sounds in ultra-high fidelity through a pair of headphones with virtually no coloration from the hardware. This demo uses FRL Research’s novel algorithm and software processing technologies and off-the-shelf hardware to illustrate the experience of personalizing audio, recreating a room’s acoustics, and making a hardware device acoustically transparent.

I’m sitting at his desk in Redmond, Washington, when he hands me a pair of headphones with microphones specially placed at the entrance of my ears. For the next two minutes, the microphones record the sounds of the room from my perspective. Hoffmann talks loudly and softly from different spots; he plays the guitar and even drops his keys behind me at one point.

Research Scientist Pablo Hoffmann plays guitar — Research Scientist Pablo Hoffmann

Hoffmann then plays back the audio recording as I listen over the headphones. It’s so realistic, it’s virtually indistinguishable from the real thing. In fact, sitting next to him at his desk, I’d bet money that he’s talking to me as I see him in my peripheral vision. But when I look at him, I can see that his lips aren’t moving — the sound I hear coming from his direction is entirely synthetic. It’s a two-minute long deja-vu.

This is what it means to create virtual sounds that are perceptually indistinguishable from reality. And when you see that work in action, it’s akin to a benevolent kind of sorcery.

“‘Perceptually indistinguishable’ is an easy thing to say,” explains Research Lead Philip Robinson. “But when you hear it, it’s magic.”

The Ingredients of Realistic Audio

When someone speaks to you in a room, one of your ears hears the sound before the other. The volume is also different in each ear. On top of that, the shape of the ear changes how each of us hears sound ever so slightly. All of these signals tell your brain where a sound comes from. And sound interacts with your environment, bouncing off the walls before making it to your ear. These are the core components that, if reproduced accurately, let virtual sounds replicate real ones.

In 2017, the audio research team helped ship spatialized audio — virtual sounds that mimic the directions sounds come from in real life. It also invented high-quality acoustic simulation technologies that make virtual environments even more believable. These technologies pushed forward the state of the art of spatial audio and powers many of today’s experiences on Oculus Quest and the Rift Platform, including First Steps and Oculus First Contact. The next frontier is personalizing spatial audio and modeling how sound interacts with real environments. During the next two stops of my tour of the Redmond lab, the team shows me their progress on both fronts.

Personalizing Spatial Audio

A researcher leads me inside an anechoic chamber — a multi-million-dollar facility suspended on springs and separated from the surrounding building by a three-feet-wide gap of air and four-inch-thick steel panels on all sides that absorb all echo. The room is so quiet, you can literally hear your own heartbeat. A mechanical arm with 54 loudspeakers from top to bottom rotates freely in a 360° arc, playing tones in order to measure how the sound reacts to the unique geometry of my ears. The entire process takes about a half an hour, and in the end, I can see what amounts to a digital representation of my personal experience of hearing spatialized audio — also known as a head-related transfer function (HRTF). The current at-scale solution used in computer gaming and VR is a “generic,” one-size-fits-all HRTF, which doesn’t provide perfect spatial accuracy for everyone. It’s like seeing through a fogged up car windshield. Personalized HRTF measurement overcomes that limitation and allows everyone to truly hear virtual sounds the same way they individually perceive real sounds. Like looking through perfectly clear glass.

While anechoic chambers clearly aren’t a scalable method of capturing unique HRTFs, the audio research team is considering several novel approaches. As just one example, they hope to one day develop an algorithm that can approximate a workable personalized HRTF from something as simple as a photograph of their ears.

Modeling a Room’s Acoustics

Understanding how sound travels through a particular space — and bounces off its surfaces before reaching the ear — is another powerful tool for making virtual sounds replicate real ones. Just as visual AR uses simultaneous localization and mapping (SLAM) to get the geometry and lighting right for virtual objects, on the sound front we need to understand the room acoustic properties to seamlessly place a virtual sound source into the real space. For my personal masterclass in room acoustics, the team invites me to play a game, trying to determine which sounds are coming from a series of physical speakers set up in the room around me and which are coming from a set of open-ear headphones I’m wearing. I can move around the space and hear the sounds respond accordingly. I consider myself a bit of an audiophile, but my attempts at distinguishing which sounds are real and which are virtual top out at about 50-50. Despite the fact that it’s coming from headphones, the spatialized audio and simulated acoustics are so realistic, my brain is fully convinced that the sounds I’m hearing are coming from the speakers in the room. I even have to pull the headphones off to confirm where the sounds are really coming from.

“Imagine if you were on a phone call and you forgot that you were separated by distance,” says Robinson. “That’s the promise of the technology we’re developing.”

To get a sense of what’s at stake here, the team shows me a demo that illustrates telepresence, the ability to feel present in a location other than your own, in real time. I sit in a room wearing a modified Oculus Rift headset and a pair of headphones, but it feels like I’m someplace else, sitting around a table with a number of researchers and colleagues. I can see the meeting room via my headset. An array of 32 microphones captures the sounds in the meeting room and delivers spatialized audio directly to my headphones so that each person’s voice sounds like it’s coming from their specific location around the table. I find myself naturally turning to face the direction of each person. This helps me follow and participate in the conversation and feel like I’m in the room itself — even though I’m actually not.

This could be a game changer for video calls with friends, family, or coworkers at a distance. With a phone call today, the other person’s voice sounds like it’s coming from the phone itself (or from the center of your head, if you’re wearing earbuds), so your brain rejects the idea that the other person might be in the same location as you. Spatial audio mimics the directions that sounds come from in real life and the environmental acoustics, so you more fully experience social presence.

Spatial audio when combined with Codec Avatars (ultra-realistic representations of people that can be animated in real time), hyper-realistic 3D reconstruction, full body tracking, shared virtual spaces, and more will allow us to crack true social presence. By letting you spend time with the important people in your life in meaningful places, we can radically transform how you live, work, and play.

“I take to heart the overall Facebook mission, which is really about connecting people,” says Robinson. “The only reason we need for virtual sound to be made real is so that I can put a virtual person in front of me and have a social interaction with them that’s as if they were really there. And remote or in person, if we can improve communication even a little bit, it would really enable deeper and more impactful social connections.”

As mind-blowing as truly spatialized audio and realistic room acoustics can be, they only touch upon the first part of the FRL Research audio team’s mission. “As we started doing this research in VR and as that morphed into AR, we realized that all of the technologies that we’re building here can serve a higher purpose, which is to improve human hearing,” explains Mehra.

AR Glasses and Perceptual Superpowers

The second part of the FRL Research audio team’s mission — to redefine human hearing — is an ambitious goal, to be sure. But it’s also directly connected to Facebook’s work to deliver AR glasses.

“Human hearing is an amazing sense that allows us to connect through spoken language and musical expression,” explains Tony Miller, who leads hardware research for the team.

“At FRL Research, we are exploring new technologies that can extend, protect, and enhance your hearing ability — giving you the ability to increase concentration and focus, while allowing you to seamlessly interact with the people and information you care about. At the heart of this work is a focus on building hardware that is deeply rooted in auditory perception and augmented by the latest developments in signal processing and artificial intelligence.”

Imagine being able to hold a conversation in a crowded restaurant or bar without having to raise your voice to be heard or straining to understand what others are saying. By using multiple microphones on your glasses, we can capture the sounds around you. Then by using the pattern of your head and eye movements, we can figure out which of these sounds you’re most interested in hearing, without requiring you to robotically stare at it. This lets us enhance the right sounds for you and dim others, making sure that what you really want to hear is clear, even in loud background noise.

What You See Is What You Hear

To experience this, I sit across the table from research scientist Owen Brimijoin in a room that simulates a restaurant. I’m wearing headphones and an off-the-shelf eye movement tracking device. Eye movement tracking is one of several solutions we’re exploring to understand what a person wants to hear. As Brimijoin begins speaking, the team brings up the background noise level. To my surprise, I can still hear him easily and converse naturally. And when I look at the TV in the corner, the commercial it’s playing magically gets louder as other sounds get quieter. When he starts speaking again, I can understand him perfectly and our conversation resumes. As with Hoffman’s demo, the demo pairs FRL Research’s software with off-the-shelf hardware to illustrate the experience of having enhanced hearing.

FRL Research Audio Demo — FRL Research audio demo

Loud restaurants aren’t just annoying — they can also pose a potential health risk for employees. In fact, prolonged exposure to noise levels above 85 decibels — which many restaurants and bars surpass these days — can contribute to hearing loss, if exposed for long periods. By dimming the noise, we may be able to help protect people’s hearing over time.

Novel Input: Capturing Sound

Next, the team shows me an innovative use of a technology called near-field beamforming that again makes me feel like I’m witnessing a magic trick — but this time they use custom hardware developed at FRL Research. Research Scientist Vladimir Tourbabin wears a simple pair of 3D-printed glasses with a special microphone array — an input prototype. Two physical speakers in the room play music at full volume. I’m in another room where Tourbabin calls me. I pick up, and he begins to read an online article in a normal speaking voice, which is easily drowned out by the noise in his cacophonous room.

Tourbabin then flips a switch — and I can suddenly hear his voice coming through, clear as day. It’s as if someone’s turned down the volume of the background noise so I can easily focus in on what he’s trying to say. It’s like getting a call from a friend at a rock concert or in a subway station, yet somehow I can hear his voice clearly and understandably. And it’s made possible by a seemingly simple set of microphones placed ever-so-opportunely on his plastic glasses frames that isolate his voice from the noise around him. You can imagine a future in which this technology could also let me speak to my AI Assistant quietly even in a noisy room, giving me more privacy and security, and preventing people nearby from being mistakenly picked up by my assistant or a call.

Output: Controlling Volume

The audio team’s aim is to cover the full range of sounds humans can hear, from 20 to 20,000 Hz. We’re currently developing special in-ear monitors (IEMs) — an output prototype — that will let us use active noise cancellation technologies to effectively turn down the volume on unwanted background noise, helping people to hear more clearly and safely in noisy environments. And when we pair this with FRL Research’s input prototype — including the microphone array — we can deliver the full experience of auditory superpowers.

“Our IEMs also feature perceptually transparent hear-through,” explains Audio Experiences Lead Scott Selfon, “making it sound like I have nothing in my ears, and letting me safely hear the entire world around me” — similar to Hoffmann’s earlier demo, only this time using a tiny earpiece.

This is that magic made real.

Improving Lives

The possibilities for this research are immense. While the majority of our perceptual superpowers research is focused on transforming communication for everyone, everywhere, we believe some of it could also inform new work in the area of hearing sciences. According to Johns Hopkins, roughly one in five people in the US have hearing loss. Many of them don’t use hearing aids for a variety of reasons including expense, social stigma, discomfort, and lack of reliability.

Recently, the team welcomed Thomas Lunner, a renowned hearing scientist whose work formed the basis for the world’s first digital hearing aid in 1995 and who will explore this research path further. “By putting hearing impaired people on par with people with normal hearing, we could help them become more socially engaged,” he says. “This resonates very well with Facebook’s mission in the sense that hearing loss often keeps people away from social situations.”
“I’ve been wearing hearing aids since I was a little girl,” adds Technical Program Manager Amanda Barry.

“The ability to help people stay connected with their families as they get older and their hearing fades — that’s really pretty exciting.”

Amanda Barry sitting at table — Technical Program Manager Amanda Barry

Hearing sciences is an area that we’re starting to explore separately from our work on AR glasses. It has unique challenges that we think deserve attention, and we hope to help push the science forward. We’ll share more as the research progresses.

Designing With Integrity and Privacy in Mind

For smart AR glasses to be successful, we need to develop the technologies in play thoughtfully and responsibly. Although we’re still in the early stages of this research, we’ve started exploring ways to ensure user privacy and safety. And as we work to enhance people’s experience of sound, we must remain cognizant and respectful of social norms.

“The goal is to put guardrails around our innovation to do it responsibly, so we’re already thinking about potential safeguards we can put in place” notes Mehra. “For example, before I can enhance someone’s voice, there could be a protocol in place that my glasses can follow to ask someone else’s glasses for permission.”

Another issue the team is keenly aware of is the capture of sensitive ear data, both in the research phase and beyond. Today, before any data we collect is made available to researchers, it is encrypted and the research participant’s identity is separated from the data such that it is unknown to the researchers using the data. Once collected, it’s stored on secure internal servers that are accessible only to a small number of researchers with express permission to use it. The team also has regular reviews with privacy, security, and IT experts to make sure they’re following protocol and implementing the appropriate safeguards.

“Deepfakes” — video and audio that uses AI and pre-existing footage to fabricate a scene, such as a person saying something they never actually said in real life — is another issue we’re thinking through. For example, we’re discussing building robust identity verification technology — such as facial analysis — into headsets and glasses to ensure only you can access the avatar your voice is tied to, from your own device.

“Obviously we’re a ways away from this type of technology in glasses and headsets, but we want to collectively think about both the implications of these technologies and potential solutions with the broader society,” says Mehra.

“It’s also one of the reasons we’re discussing this research now. We’re committed to doing it out in the open so there can be a public discourse about the acceptable uses of this technology.”

What if you could easily hear the person you were talking to, regardless of background noise or distance? If you didn’t have to miss out on special events due to travel? If you could replace your high-end stereo, television, cell phone, and more with a single wearable device?

That’s a future we believe in — and we’re working to make it a reality.

The Next Frontier: Auditory Machine Perception

Ultimately, one of our main goals is to deliver stylish AR glasses that can understand not only the visual but also the acoustic world around you and use that knowledge and context to help you navigate your way around the world. To do that, we’ll tap into LiveMaps — a virtual map with shared and private components. An understanding of the acoustic soundscape can add information to that map, so that AI can improve your audio experience while assisting you in other useful ways. When you walk into a restaurant, for example, your AR glasses would be able to recognize different types of events happening around you: people having conversations, the air conditioning noise, dishes and silverware clanking. Then using contextualized AI, your AR glasses would be able to make smart decisions, like removing the distracting background noise — and you’d be no more aware of the assistance than of a prescription improving your vision.

“There’s also an opportunity for our AR glasses to not only help us hear better, but also help us understand better,” adds Selfon.

“If I can’t follow a conversation because of background noise or a language barrier, we can use contextualized AI and speech recognition to help me with real-time visual transcriptions or translation. And unlike an Assistant that’s on your counter at home, the AI Assistant coming with you would have full contextual awareness, so it can automatically raise its voice when you’re in a noisy environment or speak softly when you’re in a quiet place like a library.”

This is another area we’re just starting to explore, and we’ll share more in the future.

“The moment we’re in is such a crucial moment in the history of AR/VR technology,” says Mehra. “Five or 10 years from now, if somebody joins the field, they’ll be chasing tail lights. At this moment, we can actually define what the future can be. We can make things so real that you don’t need to travel hundreds or thousands of miles to attend a meeting or connect with your loved ones. We can build technologies that can be used to improve human hearing. If you’re passionate about that perspective, then this is the place to be and this is the time to make it happen.”