Behind the Synthetic Voice

Voice Technology VOICE Summit 22-02-17 Modev Staff Writers 4 min read

AI-powered voice tech is pretty much everywhere these days. The AI industry just keeps on booming and we now have smart assistants that enable users to order food through their smart speakers, and that same voice tech can also take their customer service calls. And while a lot of time and money has been - and is being - spent on the crucial mission of making voice experiences more naturally interactive, there's another facet of AI-powered voice tech that doesn't revolve as much around interactivity. That's AI-generated voice, also referred to as synthetic speech, or "deepfakes as speech." And it's having a significant impact on the film, television, and gaming industries.

What is AI voice generation?

The definition of AI voice generation is included in its name: it's voice generated through AI. The voices we interact with when using a smart assistant are AI-generated voices. But that smart speaker's voice isn't as natural as, say, your spouse's voice. The smart assistant's voice isn't as flowing or natural, and there's just something slightly "robotic" about the way it speaks.

But recent advancements in deep learning have enabled the generation of eerily realistic and human-sounding voices, which are now being used in adverts, TV shows, films, and video games. And you'd be hard-pressed to distinguish the AI from the real thing.

Here's the official trailer for Roadrunner: A Film About Anthony Bourdain, the recent documentary on the life of the late Anthony Bourdain. The film's director used Bourdain's previously recorded speech to generate a few new lines that Bourdain never uttered. Some of those lines are in the trailer. Can you figure out where? Don't feel bad - I can't either…

How it works

It's pretty impressive how far speech generation has come since the days of scam robocalls (actually, I still get a few robocalls here and there…). You can hear these AI-generated voices breathe in a very natural way. They also pause where it makes sense. These voices have a human tone and convey emotion. So how did we get here?

Going back to our somewhat robotic voice assistants, they sound robotic because they essentially stitch together pre-recorded words that don't always sound natural when lined up with one another. Deep learning changed all that; it changed the requirements to achieve a natural-sounding synthetic voice. Developers no longer needed to code the voice's pronunciation, the pacing, or the intonation. They just feed a few hours of recorded speech to the algorithm, which will learn all of those subtle patterns on its own.

Deep-learning is a sophisticated process that typically combines two or more deep-learning models, each with its particular focus, to achieve impressively natural results. With voice generation, you could have a first model broadly predicting what the speaker will sound like (things like pitch, timbre, and accent). This data is then augmented by the second deep-learning model that's more concerned with subtleties such as breathing, timing, and environment (i.e., is the speaker in a bathroom or a church? The way the voice resonates in its environment is also taken into account). And there could be a third and even a fourth model - limited only by your computing power.

Machine-made or hand-crafted?

The advertising industry was one of the first to embrace AI-generated speech - and we understand why. Brands need to produce ads in different languages in order to cover all of the localities where their products are sold. Traditionally, that would require an English voice actor, a Spanish voice actor, a French voice actor, etc. With AI-generated voice, all that's needed is to switch languages or tweak the accent. The same voice can be used across multiple languages, which is a boon to brands as they benefit from the consistency.

It also opens the door to adapting one's message based on the target audience. Brands can tailor their messaging to the podcast's audience, for example. They can choose to run one of many versions of the same ad based on who's listening. The same ad could direct listeners to a different shop, based on where the ad is aired.

Many gaming studios have also jumped on the AI-generated voice bandwagon, using synthetic voices to make their characters realistically cry, laugh, and whisper. And the same process is used in film and television, particularly in, though not limited to, animation films and TV shows.

Where does that leave voice actors?

While long-term predictions are inevitably harder to make than short-term ones, voice actors aren't going anywhere in the short term. Though the tech is pretty awe-inspiring, it still has some limitations. For example, synthetic voices lose their realism with long stretches of speech. That makes them less than ideal for things like audiobooks or podcasts. Plus, the guidance a voice actor can get from their director is lost with AI voices.

Also, voice actors are still a critical part of the equation for AI-generated speech. Any synthetic voice you want to create is going to need to be trained by… an actual human voice. So the nature of the work might change, but for now, we still can't do it without actual human voices.

Today, voice actors' main concern is not the fear of being consigned to oblivion. They're more worried about losing control of their voices and not being fairly compensated. After all, less work will mean less money. But if from that initial work, advertisers are going to generate multiple "spin-offs," generating profits for the company but for which the voice actor will not be paid, it raises some ethical questions.

What if the advertising company wants the synthetic voice to say something that the original voice actor isn't comfortable with, something they would have refused to speak? What happens then? It's their voice. And for a voice actor, their voice is their brand and reputation.

Wrap up

New technology is always going to be somewhat disruptive. Humanity tends to make its mistakes in the short term and tends to fix them long term. So we're never problem-free. But over time, we iron out our issues - making room for brand new ones, I guess. That's us. This is voice tech. And I think that through cooperation and open discussion (as well as through trial and error), we can eventually strike a balance between the benefits and the costs of AI speech generation. At Modev, we look forward to being part of that discussion as we continue to work with the world's leading voice technologists and shine a light on all aspects of the coolest and most innovative voice tech.

About Modev

Modev was founded in 2008 on the simple belief that human connection is vital in the era of digital transformation. Today, Modev produces market-leading events such as VOICE Global, presented by Google Assistant, VOICE Summit, and the award-winning VOICE Talks internet talk show. Modev staff, better known as "Modevators," include community building and transformation experts worldwide. To learn more about Modev, and the breadth of events offered live and virtually, visit modev.com.

Modev Staff Writers

Modev staff includes a talented group of developers and writers focused on the industry and trends. We include Staff when several contributors join forces to produce an article.

Behind the Synthetic Voice

What is AI voice generation?

How it works

Machine-made or hand-crafted?

Where does that leave voice actors?

Wrap up

About Modev

Modev Staff Writers

Ready to Transform your Business with Little Effort Using Brightlane?

Voice Payments Are Coming

Matchbox.io and Volley Join Voices

VOICE22 - Talking About Contact Center Operations