If you ring a call centre, the first voice you often hear isn’t real, but a digitised human with whom you “speak” to process your request using voice recognition.
In its most basic form, computerised voice recognition uses basic pattern matching, while more complex systems use detailed mathematical and predictive models.
To find out how they work we will first delve down into why voice recognition is so difficult and explore three main methods that are found in everyday products.
Understanding spoken words is something that we take for granted, but it’s a highly complex process that we learn during childhood.
Packets of sound, called ‘phones’, are the building blocks of words. These are stored in our brain as ‘phonemes’, the ideal form of the phone. When chatting to someone, you hear phones and convert those blocks into speech that conveys a message.
While it sounds straightforward, there are a few issues when a computer tries to do the same thing. Homophones, words that sound identical but mean different things, can confuse a voice recogniser unless the words are analysed in context (which we will get to later).
Some people also speak very quickly, melding an entire sentence into a single unbroken sound that – for a computer – can be hard to separate into words.
In loud environments, it can also be hard to separate the sound of a voice from background noise, particularly if the ambient noise comes from people talking.
On top of all of this are dialect variations that change how the same word is said between different populations.
Taking this all into account, let’s now look at how voice recognition overcomes these issues.
Simple pattern matching
As far as voice recognition goes, this is as simple as it gets – it relies on a computer listening to a word and matching its audio pattern to a preloaded phrase.
It’s the type of recognition used by automated call centres where simple ‘yes’ and ‘no’ or ‘one’, ‘two, ‘three’ responses are enough to direct the caller. This small (around 10) group of words, known as a ‘domain’, allows the software to recognise a broad range of dialects.
The system only works for words that sound completely different. Even then, it may have trouble, forcing the call to be directed to a human operator.
Pattern and feature analysis
This type of recognition is far more complex, looking at the individual components of each word such as the number of vowels.
It relies on a system being able to identify a word from its audio footprint, a set of sounds called an utterance.
Sound waves are converted into a spectrogram, a graph that shows how the sound changes over time. Each of the approximately 46 phonemes in the English language has a signature, which, when put together in various orders, form (in theory) a word. This is known as the beads-on-a-string model.
For homophones, though, the system falls apart. Saying “read” (as in “I read a book”) and “red” will give the same result. In order to overcome this, we need context.
Language modelling and statistical analysis
This type of voice recognition is found in mobile devices and speech recognition software.
In English, adjectives generally come before a noun rather than the other way round (big truck versus truck big). Some words also generally precede others such as “for”, “good”, “an” in front of “example”, and nouns are not repeated. This is known as the language model.
If a computer isn’t sure of one of the words, it employs mathematical models and probability – looking at the words before and after it, for instance – to make an educated guess.
The future of voice recognition
At the moment, speech-to-text or speech-to-command is all voice recognition can do – and then, only some of the time.
One possible idea for improving this is to build artificial neural networks, computers that use millions of electronic nodes to function much like a brain by activating different pathways.
Jake Port contributes to the Cosmos explainer series.