How Large Language Models Will Create Tomorrow's Voice Experiences

Modev News VOICE Summit 22-11-3 Modev Staff Writers 5 min read

AI-driven voice tech is not a fringe novelty - it's very much mainstream. And it keeps growing in adoption and sophistication. Many - if not most - of us have directly experienced this growth, having moved from voice dialing to voice search. Voice tech is ubiquitous today, and it's likely here to stay.

That also means we can expect further innovation and incredibly engaging experiences moving forward. One of the drivers of those experiences is referred to as Large Language Models (LLM). LLMs pave the way to move intuitive, complex, and linguistically subtle experiences. That latter point is key here.

What I mean by "linguistic subtlety" is an AI assistant's ability to pick up on subtle points of language, like humor, for example. To make these new experiences a reality, we're going to need interfaces with bigger and better Natural Language Processing (NLP) skills. For example, how is a machine supposed to get a pun? How can a machine detect irony, sarcasm, hesitation, or confidence? In other words, how can artificial intelligence "know" when to interpret a statement literally and when to interpret a statement figuratively?

A big part of the answer lies in Large Language Models.

In this post, we're going to look at what Large Language Models are and how they work. We'll then provide examples by drawing upon OpenAI's GPT3 and Google's PaLM-SayCan project to illustrate just how powerful LLMs can be.

What Are Large Language Models?

In simple terms, a language model's principal task is to predict what words are next as a statement is pronounced. So imagine the statement starts with "We should try and meet to discuss this, how about…" - a good language model would predict something along the lines of "tomorrow" or "next week" rather than "last year" or "John." There are many different language models, and we've been using them for a long time. Most current Language Models aren't very sensitive to intonation or intent; they simply regurgitate speech patterns on which they've been trained.

Large Language Models are just that: language models that are larger - a lot larger. And the word "large" in Large Language Models refers to two things:

The amount of data they're trained on
The number of parameters in the model itself

Let's turn to some impressive examples of LLMs to illustrate the above.

OpenAI's GPT-3 - The First Giant Leap

When OpenAI launched its GPT3 Large Language Model, it was a game-changer. GPT-3, the third iteration of OpenAI's LLM, was trained on 500 billion tokens (words or parts of words), and the model contains 175 billion parameters. That "largeness" makes LLMs much more capable of detecting subtleties in language, intonation, intent, etc. It enables it to output strings of speech so close to actual human speech that, for your average person, they're utterly indistinguishable from natural speech.

It's so realistic and compelling that some have used OpenAI's GPT-3 API and recreated chatbot versions of deceased loved ones so realistic and compelling that one can't help but contemplate metaphysical and existential questions about consciousness… And that speaks to how natural, coherent, and real LLMs can feel.

Google's PaLM - Raising The Bar Even Higher

In April 2022, Google upped the ante when it unveiled its native LLM, PaLM. PaLM stands for Pathways Language Model, and Google credits it with breakthrough language understanding and reasoning capabilities. This latter point is critical to PaLM.

What makes PaLM so impressive is its ability to reason. PaLM goes beyond simply regurgitating patterns of text it was trained with. It's capable of conceptual understanding, has a grasp on causation (cause and effect), and can even understand jokes. The language model's sophistication makes it capable of surprisingly subtle interpretations. Here are a few examples taken from Google's AI blog.

Cause & Effect

Prompt:

Which of the following sentences makes more sense?

I studied hard because I got an A on the test.
I got an A on the test because I studied hard.

PaLM's Response:

I got an A on the test because I studied hard.

Conceptual Understanding

Prompt:

Concept: moving definition. Question: Which of the following sentences best characterizes moving definitions?

Moving definitions can be very fast.
Moving definitions define movements.
Moving definitions can be expensive.
Moving definitions change often.

PaLM's Response:

Moving definitions change often.

Humor

PaLM pushes the language model envelope even further to the point of "getting" jokes.

Prompt:

Explain this joke:

Joke: Did you see that Google just hired an eloquent whale for their TPU team? It showed them how to communicate between two different pods!

PaLM's Response:

Prediction: TPUs are a type of computer chip that Google uses for deep learning. A "pod" is a group of TPUs. A "pod" is also a group of whales. The joke is that the whale is able to communicate between two groups of whales, but the speaker is pretending that the whale is able to communicate between two groups of TPUs.

While that may be somewhat clinical (and aside from the fact that jokes that are explained are rarely funny), the above is an example of a machine "getting" a joke. That's huge. And the implications it has on AI speech interpretation cannot be overstated.

Wrap Up

So those are LLMs in general, and GPT-3 and PaLM in particular. The crazy thing is that records are meant to be broken. So as impressive as these capabilities are today, we're likely to be on another level altogether in just a few years. The experiences these advancements could produce are varied that they're hard to pin down. However, one thing's for sure; they're bound to be as compelling as they are impressive.

The future is vocal (and LLMs are a big part of that future).

Modev

Modev was founded in 2008 on the simple belief that human connection is vital in the era of digital transformation. Modev believes markets are made. From mobile to voice, Modev has helped develop ecosystems for new waves of technology. Today, Modev produces market-leading events such as VOICE Global, presented by Google Assistant, VOICE Summit the most important voice-tech conference globally, and the Webby award-winning VOICE Talks internet talk show. Modev staff, better known as "Modevators," include community building and transformation experts worldwide. To learn more about Modev, and the breadth of events and ecosystem services offered live, virtually, local and nationally - visit modev.com.

Modev Staff Writers

Modev staff includes a talented group of developers and writers focused on the industry and trends. We include Staff when several contributors join forces to produce an article.

How Large Language Models Will Create Tomorrow's Voice Experiences

Modev Staff Writers

Ready to Transform your Business with Little Effort Using Brightlane?

Voice Payments Are Coming

Karen Kaushansky - Voice tech from yesterday to tomorrow

Synthetic Assets and Intellectual Property