Voice recognition is becoming an integral part of current mobile devices as we use it to activate several functions, the voice transforms into text so that it can be sent as a message and the digital assistants use it constantly to answer every question. In 2012, Google changed the Gaussian Mixture Model (GMM) that had been used for over 30 years for a new standard called Deep Neural Networks (DNNs), which provided better results for sounds produced by users at any moment and the accuracy of speech recognition was also improved.
Now, Google has announced that they are changing to a new model using a technology called Connectionist Temporal Classification (CTC) and sequence discriminative training techniques. These new models are extensions of a sort of artificial intelligence called recurrent neural networks (RNNs), but they will provide more accurate results, particularly when there’s noise in the background, plus the speed of voice recognition has also been improved. The improved RNNs can capture how easily a word is spoken for a better recognition, plus it can memorize information better than other systems. The CTC models allow for the recognization of phonemes without making a prediction every instant, it works by taking larger audio chunks so less computations are made and thus making a faster recognition. Artificial noise was added to train the sequences, and that’s how the improvements on noisy environments were accomplished.
Then, a problem was found, as there was a delay of about 300 milliseconds that was discovered in the way the model recognized the phonemes, so they had to train the model to predict the phonemes in a closer time of speech. The new models are integrated into the Google app for Android and iOS operating systems and dictation with the new model is available in Android devices. In the video below, you can see how the RNNs learn how to recognize the phrase “How cold is it outside”. The phonemes are represented in colors and each creates a spike that can be recognized by the CTC model. At first, it seems to recognize all sorts of audio input and by the end of the video, each phoneme representation is separated and aligned where it belongs.