The key to conversational speech recognition

Advancements in statistical AI applications for understanding and generating text have been nothing short of staggering over the past few years. Many believe it’s only a matter of time before audio manifestations of natural language, including speech recognition and the relatively recently emergent field of voice AI, follow suit.

Based on the some of the recent developments in this form of advanced machine learning, that time may be sooner at hand than most people realize.

Deepgram recently released a cognitive computing model specifically designed to account for interruptions, pauses, and turn-taking in spoken conversations—all facets of speech recognition that have traditionally beguiled most spoken interactions between humans and machines. Because the model, Flux, was trained to accommodate these subtleties of spoken dialogues, it provides more natural conversations between humans and voice AI agents than other approaches do.

The underlying architectural methodology that powers the model, which is predicated on deep neural networks, is a significant contributor to its facility for improving speech recognition. Pairing these attributes with some of the basic principles of streaming data endows a degree of real-time contextualization that even models based on micro-batch paradigms can’t match.

In addition to making speech recognition systems more sophisticated, the gains from this combination are a timely reminder that avant-gard deployments of statistical AI do not solely involve foundation models, frontier models, and language models.

Deep Neural Networks

Although the basis of most language models is firmly rooted in neural networks and the transformer architecture, many people consider the very term deep neural networks a dated reference to cutting-edge cognitive computing deployments from five to 10 years ago. However, the core principles of this type of advanced machine learning are not only indirectly responsible for contemporary language models, but also are effective without the millions and tens of millions of parameters and hyper-parameters that LLMs have. By coupling timeless machine learning approaches, like supervised learning and semi-supervised learning, with the computational efficacy of deep learning deployments, Deepgram’s speech recognition model was trained on innumerable hours of conversations.

According to Natalie Rutgers, VP of Product at Deepgram, “We built an entirely different architecture for this. Our speech recognition models are built on end-to-end deep learning. They are transformer-based models with encoders and decoders. There’s specifically RNNs and CNNs in our models.” Incorporating aspects of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) enables contemporary speech recognition models to provide capabilities like context-aware turn detection to understand the appropriate time to respond to people.

Traditionally, “for machines, the only way they do this is essentially look for silences in conversations,” Rutgers explained. Other techniques involve machines analyzing text, as opposed to analyzing real-time spoken inputs. Because of the extensive training modern speech recognition models undergo based on actual conversations between people, they can comprehend turn detection for real-world use cases like “repeating an alpha numeric string back or, say, a United miles account ID; you don’t know how many digits to expect,” Rutgers pointed out. “Or, someone thinking about what sort of toppings they want on a pizza, and they’re ‘umming’ and ‘aahing’.

Streaming-first architecture

These and other advantages of modern speech recognition systems are partly attributed to the prioritization of a streaming data approach to analyzing conversations. By relying on what Rutgers termed a streaming-first architecture, contemporary models in this space have greater contextualization capabilities. As it turns out, the capacity for models to glean audio inputs for speech recognition use cases in real-time for conversational interactions may be something of an anomaly.

According to Rutger, “Most of the time, these models are built so that you can take a recorded conversation, like a Zoom recording after a call, send that through, and the model will give you back a transcript. The way that people have made streaming versions of that work so far is they’ll chunk audio as it’s coming in every two to three seconds.” In addition to causing latency, this micro-batch paradigm can limit the context within these temporal windows. The shortfalls of this method are readily discernible. “It’s not cohesive; you’re not building that context over time,” Rutgers said. “There’s a lot of limitations that come from that in terms of the accuracy, the contextual understanding.”

By eschewing such limitations, vanguard speech recognition models furnish, in real-time, native functionality for understanding and reacting accordingly to interruptions in conversation. With this characteristic, speech recognition systems can effectively listen to and understand humans—even while the model itself is generating a spoken response. Without it, the system would underlie voice AI applications in which “voice agents don’t know when to pause,” Rutgers remarked. “They’re not good at knowing that someone has butted-in and is trying to interject. That’s why if you’ve ever yelled ‘agent, agent, agent!’ at an IVR system and not had any luck with that, it’s because they don’t have that ability.”

A long way

The capabilities of today’s speech recognition models are certainly a long way from those actuated by Interactive Voice Response (IVR) systems, which facilitate simple, template-based responses to a limited selection of inputs. They’re also an affirmation of the continued prowess of deep neural networks at a scale that’s nowhere near that employed for LLMs. Moreover, they help to close the proverbial gap between applications of speech recognition, or voice AI, and those of its more celebrated textual deployments.

The key to conversational speech recognition

Deep Neural Networks

Streaming-first architecture

A long way

Leave a Reply Cancel reply