How a robot is trained by Google on Gemini 1.5 AI model

Google, at I/O 2024 has showcased their multimodal capabilities in the Gemini 1.5 AI model. This enhances the ability to enable the model to take photos, videos, and audio, along with text, as the inputs will process the information and generate the responses.

The company’s AI unit is now working towards leveraging this capability to train the robots and feature access to navigate their surroundings. With recent reports, Google Deepmind has published research information featuring a bunch of details that show how a robot can be trained and can understand multimodal instructions, which includes natural language and images and it also performs useful navigation.

Google stated “To achieve this, we study widely useful category of navigation with demonstration Tours (MINT), in which the environment prior is provided through a previously recorded demonstration video,” in addition the Vision Language Models (VLMs) have shown a promising path in achieving this goal with advance features.

How can Gemini 1.5 Pro’s long context window help robots navigate the world? 🤖

A thread of our latest experiments. 🧵 pic.twitter.com/ZRQqQDEw98
— Google DeepMind (@GoogleDeepMind) July 11, 2024

How a robot is trained by Google on Gemini 1.5 AI model

According to Google deepmind it was noted that limited context length is becoming challenging for many AI models to recall environments, however, Gemini’s 1.5 Pro’s 1 million token context length has helped the company to train their robots for navigation.

“Robots can use human instructions, video tours, and common sense reasoning to successfully find their way around space,” said the trainers.

The robots are taken on a tour by the trainers in specific areas to make them understand the real-world setting and this process highlights some key places to recall and make them aware of the surrounding environment.

Let us understand with an example, suppose we consider the environment of an office, the robot is guided to summarize the location of the company like the way towards the Chairman’s office, canteen area or play area, and so on.

If the case of any guest or outsider comes to visit the company they can ask the robot to guide them over to their desired path, they just have to command “Take me to the Finance Department” it then the robot will take the lead reaching toward the path. Similarly, the fresher employees can ask the robot for its lead to guide them to the canteen area through the employee’s location.

“The system’s architecture takes in these inputs and then creates a topological graph – or a simplified representation of a space. This is constructed from frames within tour videos, which captures the general connectivity of their surroundings to find a path without a map,” said Google Deepmind.

As per Google, the company has evaluated the output in an office and from a home-like environment and has achieved 86% and 90% success rates, respectively.

“In the future, users could simply record a tour of the environment with a smartphone for their personal robot assistant to understand and navigate,” the company added.

Via

Google Utilizes Advanced Gemini AI 1.5 Model to Enhance Robotic World Navigation Capabilities

ByAisha Singh

How a robot is trained by Google on Gemini 1.5 AI model

By Aisha Singh

Related Post

Gemini Rolls Out the Spotify Extension, Hear Music from Your Android Gemini App

Instagram To Bring AI Video Editing Tool, Roll Out in 2025

Gemini Now Allows Uploading Code Repository for Streamlining Work

Leave a Reply Cancel reply

AyuTechno