Teaching AI to communicate sounds like humans do

Inspired by the mechanics of the human vocal tract, a new AI model can produce and understand vocal imitations of everyday sounds. The method could help build new sonic interfaces for entertainment and education.

Whether you’re describing the sound of your faulty car engine or meowing like your neighbor’s cat, imitating sounds with your voice can be a helpful way to relay a concept when words don’t do the trick.

Vocal imitation is the sonic equivalent of doodling a quick picture to communicate something you saw — except that instead of using a pencil to illustrate an image, you use your vocal tract to express a sound. This might seem difficult, but it’s something we all do intuitively: To experience it for yourself, try using your voice to mirror the sound of an ambulance siren, a crow, or a bell being struck.

Inspired by the cognitive science of how we communicate, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have developed an AI system that can produce human-like vocal imitations with no training, and without ever having "heard" a human vocal impression before.

To achieve this, the researchers engineered their system to produce and interpret sounds much like we do. They started by building a model of the human vocal tract that simulates how vibrations from the voice box are shaped by the throat, tongue, and lips. Then, they used a cognitively-inspired AI algorithm to control this vocal tract model and make it produce imitations, taking into consideration the context-specific ways that humans choose to communicate sound.

The model can effectively take many sounds from the world and generate a human-like imitation of them — including noises like leaves rustling, a snake’s hiss, and an approaching ambulance siren. Their model can also be run in reverse to guess real-world sounds from human vocal imitations, similar to how some computer vision systems can retrieve high-quality images based on sketches. For instance, the model can correctly distinguish the sound of a human imitating a cat’s “meow” versus its “hiss.”

In the future, this model could potentially lead to more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even methods to help students learn new languages.

The co-lead authors — MIT CSAIL PhD students Kartik Chandra SM ’23 and Karima Ma, and undergraduate researcher Matthew Caren — note that computer graphics researchers have long recognized that realism is rarely the ultimate goal of visual expression. For example, an abstract painting or a child’s crayon doodle can be just as expressive as a photograph.

“Over the past few decades, advances in sketching algorithms have led to new tools for artists, advances in AI and computer vision, and even a deeper understanding of human cognition,” notes Chandra. “In the same way that a sketch is an abstract, non-photorealistic representation of an image, our method captures the abstract, non-phono-realistic ways humans express the sounds they hear. This teaches us about the process of auditory abstraction.”

The art of imitation, in three parts

The team developed three increasingly nuanced versions of the model to compare to human vocal imitations. First, they created a baseline model that simply aimed to generate imitations that were as similar to real-world sounds as possible — but this model didn’t match human behavior very well.

The researchers then designed a second “communicative” model. According to Caren, this model considers what’s distinctive about a sound to a listener. For instance, you’d likely imitate the sound of a motorboat by mimicking the rumble of its engine, since that’s its most distinctive auditory feature, even if it’s not the loudest aspect of the sound (compared to, say, the water splashing). This second model created imitations that were better than the baseline, but the team wanted to improve it even more.

To take their method a step further, the researchers added a final layer of reasoning to the model. “Vocal imitations can sound different based on the amount of effort you put into them. It costs time and energy to produce sounds that are perfectly accurate,” says Chandra. The researchers’ full model accounts for this by trying to avoid utterances that are very rapid, loud, or high- or low-pitched, which people are less likely to use in a conversation. The result: more human-like imitations that closely match many of the decisions that humans make when imitating the same sounds.

After building this model, the team conducted a behavioral experiment to see whether the AI- or human-generated vocal imitations were perceived as better by human judges. Notably, participants in the experiment favored the AI model 25 percent of the time in general, and as much as 75 percent for an imitation of a motorboat and 50 percent for an imitation of a gunshot.

Toward more expressive sound technology

Passionate about technology for music and art, Caren envisions that this model could help artists better communicate sounds to computational systems and assist filmmakers and other content creators with generating AI sounds that are more nuanced to a specific context. It could also enable a musician to rapidly search a sound database by imitating a noise that is difficult to describe in, say, a text prompt.

In the meantime, Caren, Chandra, and Ma are looking at the implications of their model in other domains, including the development of language, how infants learn to talk, and even imitation behaviors in birds like parrots and songbirds.

The team still has work to do with the current iteration of their model: It struggles with some consonants, like “z,” which led to inaccurate impressions of some sounds, like bees buzzing. They also can’t yet replicate how humans imitate speech, music, or sounds that are imitated differently across different languages, like a heartbeat.

Stanford University linguistics professor Robert Hawkins says that language is full of onomatopoeia and words that mimic but don’t fully replicate the things they describe, like the “meow” sound that very inexactly approximates the sound that cats make. “The processes that get us from the sound of a real cat to a word like ‘meow’ reveal a lot about the intricate interplay between physiology, social reasoning, and communication in the evolution of language,” says Hawkins, who wasn’t involved in the CSAIL research. “This model presents an exciting step toward formalizing and testing theories of those processes, demonstrating that both physical constraints from the human vocal tract and social pressures from communication are needed to explain the distribution of vocal imitations.”

Caren, Chandra, and Ma wrote the paper with two other CSAIL affiliates: Jonathan Ragan-Kelley, MIT Department of Electrical Engineering and Computer Science associate professor, and Joshua Tenenbaum, MIT Brain and Cognitive Sciences professor and Center for Brains, Minds, and Machines member. Their work was supported, in part, by the Hertz Foundation and the National Science Foundation. It was presented at SIGGRAPH Asia in early December.