In December 2020, Google announced ‘Look To Speak’, an application designed to help people with disabilities to speak through their eyes. With the learning acquired two years later, this technology continues to advance with the aim of be an integrated element within products with Google Assistant.
The objective? No longer needing to say the awkward ‘Ok Google’ to activate Google Assistant, and it being able to automatically activate when catch our eye. Google has explained how this technology works and the challenges they have faced.
The Assistant listens to you, but only if it is “watching you”
Although ‘Ok Google’ remains one of the cornerstones of activating Google Assistant, Google’s technical explanation of ‘Look to Speak’ starts off strong: “In natural conversations, we don’t say people’s names every time we go to them”.
Google wants Assistant to be as human-like and natural in its interaction as possible, including being able to start talking to you when you make eye contact. To achieve this they announced ‘Talk to Speak’ at Google I/O 2022, explaining now that it is the first time that the device Simultaneously analyze audio, video and text.
Creating the model was not something as simple as activating the Google Assistant when we look at the Google Nest Hub, a device that has this technology. The function it is only activated if the model detects that we want to interact with it. To do this, the distance of the subject from the device, the orientation of the head, the gaze, determine if the direction of the subject is optical for an active conversation, etc.
For these analyses, we analyze frames of both video and audio input, to predict whether the user is talking and interacting with their home environment (if, for example, we are talking to someone at home, Assistant detection should not be triggered). The audio input is tied to Google’s Voice Match, so the Assistant won’t interact with anyone whose voice it doesn’t recognize.
Especially interesting regarding audio is that the model detects if we are trying to query the wizard analyzing non-lexical information. In other words, the tone of voice, the speed and some contextual signals are analyzed to understand whether or not we want to make a query.
At the moment, ‘Look to Speak’ is reserved for the Nest Hub, but it is not ruled out that it will end up reaching more Google devices.