Google came up with a new way to match faces and voices, and has filed a patent on it with the World Intellectual Property Organization. Google’s patent is not just an enhancement to existing facial recognition features commonly found in mobile devices and apps, but also includes the ability to recognize and match multiple faces and voices in a video. The method focuses on determining when somebody is talking, then matching their voice to their face, potentially giving the machine the ability to hear and understand individual voices in a din or a noisy environment by looking for a face that matches a sound pattern that corresponds to a pre-defined voice profile.
The patent details a voice diarization system that starts by finding faces, then watches those faces when speech is present in the video in order to determine when somebody’s talking. Essentially, the goal is to catch somebody talking alone, then isolate their voice by confirming any audio matches with the movement of their mouth. Once that’s done, the voice is positively profiled and filed together with the face, creating a hard match. In crowd situations or videos with multiple speakers, the procedure is repeated for every person who speaks in the video, and once profiles are formed for everybody, the system can tell who’s speaking and when, as well as understand what they’re saying a bit better by reading their lips while transcribing audio to check that mouth movement matches what the machine’s audio processing thinks the person is saying.
This new method is based on computer vision and machine learning, and has a nearly endless stable of potential use cases. In Google’s patent filing, it speaks of automatic captioning in very broad terms, which almost certainly is in reference to the same feature on YouTube. Google detailed its efforts in this area not long ago, and swore to keep improving; this patent is likely an indication of the direction its efforts will be moving in. Better face-based security for devices, isolating vocal audio in videos, making better lip syncing for character models in computer-generated multimedia with voice acting and adding new voice and facial expression gestures to consumer devices are just a few possible uses for this new technology.