Google, at I/O 2024 has showcased their multimodal capabilities in the Gemini 1.5 AI model. This enhances the ability to enable the model to take photos, videos, and audio, along with text, as the inputs will process the information and generate the responses.
The company’s AI unit is now working towards leveraging this capability to train the robots and feature access to navigate their surroundings. With recent reports, Google Deepmind has published research information featuring a bunch of details that show how a robot can be trained and can understand multimodal instructions, which includes natural language and images and it also performs useful navigation.
Google stated “To achieve this, we study widely useful category of navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video,” in addition the Vision Language Models (VLMs) have shown a promising path in achieving this goal with advance features.
How a robot is trained by Google on Gemini 1.5 AI model
According to Google deepmind it was noted that limited context length is becoming challenging for many AI models to recall environments, however, Gemini’s 1.5 Pro’s 1 million token context length has helped the company to train their robots for navigation.
“Robots can use human instructions, video tours, and common sense reasoning to successfully find their way around space,” said the trainers.
The robots are taken on a tour by the trainers in specific areas to make them understand the real-world setting and this process highlights some key places to recall and make them aware of the surrounding environment.
Let us understand with an example, suppose we consider the environment of an office, the robot is guided to summarize the location of the company like the way towards the Chairman’s office, canteen area or play area, and so on.
If the case of any guest or outsider comes to visit the company they can ask the robot to guide them over to their desired path, they just have to command “Take me to the Finance Department” it then the robot will take the lead reaching toward the path. Similarly, the fresher employees can ask the robot for its lead to guide them to the canteen area through the employee’s location.
“The system’s architecture takes in these inputs and then creates a topological graph – or a simplified representation of a space. This is constructed from frames within tour videos, which captures the general connectivity of their surroundings to find a path without a map,” said Google Deepmind.
As per Google, the company has evaluated the output in an office and from a home-like environment and has achieved 86% and 90% success rates, respectively.
“In the future, users could simply record a tour of the environment with a smartphone for their personal robot assistant to understand and navigate,” the company added.