Modev Blog

Subscribe Here!

On Voice Tech's Elusive 'iPhone Moment'

The release of Apple's first iPhone in 2007 was indeed a pivotal moment in tech history, so to speak. The device Apple released that summer made people understand just how powerful touch interfaces could be - and touch screens had already been around for some time. But the user experience was frustrating, at best, and there was no 'touch screen revolution' in sight until the release of the first iPhone.

Voice tech appears to be at a similar crossroads, at least according to Hannes Heikinheimo of Speechly. In his article, entitled Why Hasn't the iPhone Moment Happened Yet for Voice UIs, Heikinheimo looks at the state of voice UIs today. And he explains why he feels that streaming Spoken Language Understanding and Reactive Voice User Interfaces - which make voice UIs real-time - may well be the keys to reaching that "Ah-ha" moment for voice UIs, much like the iPhone did for touch screens. 

We'll unpack the notions of streaming Spoken Language Understanding and Reactive Multi-modal Voice User Interfaces in a minute. But before we do that, let's look at how voice UIs tend to work today.

 

The Woes of Endpointing

When you have a conversation with someone, odds are you don't always wait for them to finish speaking before you start making sense of their intent. There's also a high likelihood that you'll give visual confirmations of your understanding, such as giving a nod. You could also change the subject or ask a question - as a new idea pops into your head - mid-sentence. The human mind is dynamic, and natural language reflects that.

Voice UIs don't tend to work that way. They wait until the user stops talking before they interpret and process any commands. That is called endpointing. And while it works very well for simple tasks, it falls flat for more complex tasks.

As Heikinheimo says, "Imagine if you need to express something that requires a longer explanation. When looking for a new t-shirt, a person might be tempted to say something like 'I'm interested in t-shirts for men ...in color red, blue or orange, let's say Boss ...no wait, I mean Tommy Hilfiger ...maybe size medium or large ...and something that's on sale and can be shipped by tomorrow ...and I'd like to see the cheapest options first.' When uttering something this long and winding to a traditional Voice Assistant, most likely something will go wrong, resulting in the familiar, 'Sorry, I didn't quite get that.'"

 

Streaming Spoken Language Understanding

With streaming Spoken Language Understanding, the voice system starts interpreting the user's speech as soon as they start talking, so the UI can immediately react to actionable elements pronounced by the user. This enables a Reactive Voice UI that blends alongside existing modalities like typing, tapping, or swiping.

One of the benefits of this approach is instant feedback. If the Reactive Voice UI doesn't understand what the user said, the UI will instantly signal this back to them so they can correct and continue. Conversely, if it understands what the user is saying, it can display the information immediately, making the whole voice experience smoother.

Systems using streaming Spoken Language Understanding fail and recover much more dynamically than those using endpointing. That allows for much more sophisticated and complex tasks to be accomplished. It also allows for commands to be expressed in a much more natural way. As Heikinheimo puts it, users can express themselves in a manner closer to a "stream-of-consciousness" rather than using disjointed commands.

 

Multi-Modality Enabled by Reactive Voice UIs

While traditional voice assistants provide synthetic speech as their primary form of feedback to their users, Reactive and Non-interruptive Multi-modal Voice User Interfaces use haptic, non-linguistic auditory, and visual feedback, along with voice.

An example interaction with this kind of UI would be a person looking to buy a pair of jeans, saying, "I'm interested in jeans." The UI would then immediately react to that statement by displaying popular jeans available for sale. From there, the user could instantly refine the results by uttering, "do you have Levi's?" which will fine-tune what’s displayed on the fly.

Multi-modal feedback brings the Reactive Voice UI experience closer to human-to-human conversations. In natural conversations, as Heikinheimo points out, "[...] as one person talks, the listener gives feedback with nods, facial expressions, gestures, and interjections like ‘aha’ and ‘mhm’. Furthermore, if the person listening doesn't understand what is being said, they are likely to start making more or less subtle facial expressions to signal their lack of comprehension. […] The same efficiency is exhibited in the Reactive Multi-modal Voice User Interfaces!"

 

Inching Ever-Closer to That 'iPhone Moment'

The crux of this approach is that voice isn't the only input modality; it's a modality, among others (haptic, non-linguistic auditory, and visual) that work together to create a more comprehensive UI. That means that the key to producing an 'iPhone moment' in voice tech doesn't exclusively revolve around voice taken in isolation. It means enhancing existing experiences with voice to create dynamic and fluid, multi-faceted experiences that call upon a collection of senses. And our best bet would appear to be using streaming Spoken Language Understanding and Reactive Voice UIs to create this moment. 

This new and encompassing approach to voice tech may well be ushering the dawn of its own 'iPhone moment' - time will tell. But whatever the case, the very fact that so many will experience voice tech's 'iPhone moment' on an actual iPhone illustrates just how compelling such a moment can be…

**

Hear Hannes Heikinheimo, CTO, Speechly speak at VOICE 2021. View the schedule and join VOICE 2021 - the only conversation poised to shape the future of conversational AI, taking place both in-person in Arlington, Va. and virtually, Dec. 7-8.

VOICE Summit