VOXLET: Pioneering Private Voice-Changing Technology

The world is awash with electronic devices equipped with microphones and capable of recording speech. Add in biometric recognition, emotional screening, and similar audio analysis technology, and you have a context in which it has become very difficult to speak anonymously. This not only threatens individual privacy – particularly in places where censorship or retaliation are a concern – it is also of obvious concern for sensitive intelligence operations where the ability to speak anonymously can mean the difference between life and death.

While many voice obfuscation solutions already exist, thus far none of these solutions work in real-time, nor are they secure in settings where guarantees of user privacy and prevention of inference attacks are essential for user safety.

“Today, you can use an AI voice transformer online to record yourself saying something and play it back sounding like a famous person,” said Galois Principal Scientist Taisa Kushner. “But that’s currently at the input, output level. It doesn’t change the sound of your voice as you’re speaking in real-time. More importantly, someone could analyze that output and not only discover that it’s fake, but also be able to backtrack out my original voice from it. … We’re trying to fill that gap – developing a solution for transforming speech in real-time, with formal privacy guarantees.” 

To meet this challenge, Galois, GE Aerospace, the University of Vermont, and North Point Defense are collaborating on the Intelligence Advanced Research Projects Activity (IARPA)’s Anonymous Real-Time Speech (ARTS) program, which seeks to develop new technologies for anonymizing conversational speech to help safeguard individual speakers’ identities. 

The team’s solution, Vocal Obfuscation eXpert Language Encoding Technology (VOXLET), combines learning neural network models with state-of-the-art formal privacy guarantees provided via differential privacy techniques. In addition, VOXLET will utilize novel compression techniques to enable system use on resource-limited hardware. 

Real-Time Results

One of the more significant advances being developed with VOXLET is phoneme-level voice transformation.

Like many voice-transformation technologies, VOXLET uses an encoder-decoder neural network architecture. In this framework, input speech is first encoded into a latent representation of relevant features (pitch, tempo, etc.). Next, the encoded features are modified to make the voice sound different while preserving the core content – the actual words being said. Finally, the decoder takes the modified latent features and generates a new, different-sounding output speech. The core problem with current voice transformation technologies using this framework is that they can only reliably transform complete words, rather than the individual sounds that make up words.

“That’s a massive blocker for real-time systems,” said Kushner. “It’s just not how people actually talk. If I’m talking too slow or two fast, or if I’m in a conversation and I start saying one word, but stop myself halfway through when I decide I want to say a different word, it won’t work.”

Instead, VOXLET works with phonemes – separating out, analyzing, and modifying the smallest possible phonetic unit rather than full words. This is a much finer grain level of detail for voice transformation.

For example, the spoken phrase: “Hi, um, trying to” would be processed as three transformable words by current deepfake technology. By contrast, VOXLET’s Phoneme Pipeline extracts 16 phonemes, including disfluencies (like “um”) and silences.

The result is a technology that optimizes that latent space in the middle of speech transformation, working at a much finer grain of detail, and yielding much more believable results in real-time.[1]

Formal Privacy Guarantees

The next challenge is how to make VOXLET secure – such that the output not only sounds natural, but can’t be identified as a fake nor reverse engineered to reproduce the speaker’s actual voice. For this, Galois is using a cryptography technique called Differential Privacy.Differential privacy, most often used to secure sensitive statistical data, introduces noise to protect individuals’ private information, while ensuring that aggregate statistical results remain significant and useful. With VOXLET, Galois is pioneering a new application for the technique, adding randomization noise to speech data to make individual speaker identification impossible.

The challenge here, and the focus of the team’s current research, is figuring out exactly how much noise to add and where to add it.

“We want to be able to divide the latent space into static and dynamic features,” Kushner explained. “Static features may be markers of speech, like average frequency, that are constant for an individual speaker; whereas dynamic features might be lexical content, [the actual words and phrases spoken,] or even background noise.”

By identifying and dividing key features in the latent space, VOXLET can introduce targeted noise to static features to disguise the speaker’s identity. This is a fine balancing act between safety and believability: Add too much noise, or add noise to the wrong feature in the latent space, and the speaker’s identity will be disguised, but it will be “too noisy” and obvious that their voice has been changed. Add too little noise and the output audio may still be able to be reverse engineered to uncover the speaker’s true identity.

“We’re figuring out a big optimization problem,” said Kushner. “What’s the best input? What’s the best latent space? How do we add noise to achieve our goals? That raw signal can be sorted into a bunch of moving parts, and we’re trying to work with those parts to get a final output that is synthesized speech that both sounds good and is not invertible.”

The goal is surgically precise noise insertion – just the right amount, in just the right place.

While the VOXLET team’s work is ongoing, the results are quite exciting even now. Already, the team has developed and presented a working prototype. Now, that prototype is being honed and improved. While the current phase is focusing on English language transformation, the next will add strategic multilingual capabilities – a feature made much easier with VOXLET’s phoneme approach.

If all goes well, the impact of VOXLET could be significant: not only one of the world’s first real-time voice changer, but one that uses a novel approach to yield better quality result  and protect the privacy of its users – advancing national security and likely quite literally saving lives.  

[1] This also opens up enormous possibilities for multi-lingual voice transformation, use of proper names, etc.