We recently covered how many people talk to their smartphones and part of this was Google asking about what we, the users, want our voice command to be able to do for us. Voice recognition systems were started over sixty years ago and the early systems were crude and slow, able to recognize just numbers. By the seventies, voice recognition systems could understand a thousand words and today, voice recognition has reached the point whereby there’s limited context awareness. For example, if I ask Google, “who is the British Prime Minister?” Google will tell me. If I then ask, “how old is he?” it will tell me his age without asking me who I mean. Part of the improvements in voice recognition is associated with cloud computing: raw speech data is sent to the Google server farm, where it is broken down, processed and the instructions returned to our device. This process requires enormous computing power just for what is still a very crude system. Google have released a seven minute video clip entitled “Behind the Mic: The Science of Talking with Computers” to help explain part of the problem, which you can see below:
One of Google’s breakthroughs was the creation of a crude “neural network,” which is used in speech recognition. You see, the problem with speech recognition is two fold. First, the computer must understand what is said. We all speak our own language(s) in a slightly different way to the next person; we use many subtle differences of tone, inflection and have a different accent. Simply knowing what was said is not enough. The second part is in understanding what we have said: we may color our words using sarcasm, irony or happiness. Our speech rhythm can also be different and we give off visual clues, such as hand movements, shrugs and facial expressions. The answer may be in replicating in some small way how the human brain works using the neural network, or massive arrays of parallel processing to understand what’s been said in context. The system Google demonstrate in the video clip is to break down words into a mathematical pattern, which plays to the computer’s strength (an ability to process mathematical data very quickly).
Google is forecasting great strides in the next five years. We can already do a lot with our devices without training it, but there’s much we can’t do. And we know that Google are aiming for the Star Trek voice control experience. Even if we do reach this in five years, I suspect that we will still feel awkward talking to our smartdevices!