Google’s RT-2 Enables Robots To Learn From YouTube Videos

In a game-changing development in robotics, Google DeepMind’s new artificial intelligence model, RT-2 (Robotics Transformer 2), seamlessly combines vision, language, and action to help robots understand and perform tasks with greater adaptability.

πŸ”— Source:

The RT-2 is a Vision-Language-Action (VLA) model, unprecedented in its capacity to integrate text and images from the internet and use the acquired knowledge to dictate robotic behavior. Acting as a sort of ‘interpreter‘, RT-2 can ‘speak robot‘, translating vision and language cues into physical actions.

For example, the Google researchers came up with the idea of representing robot actions as text strings such as "1 128 91 241 5 101 127 217" that represent concrete behavior such as positional or rotational change of the real robot:

And because actions are represented as strings, this can be seen as just another language to learn. Roughly speaking, training a robot in the real-world has now become a problem that can be solved by a large language model (LLM)! And we can use digital data from the web to do so instead of needing to learn from real-world data only!!!

If you forget all else from this article, I want you to take this one idea out of it:

You can now train robots on YouTube! 🀯

Here’s an example of how Google’s robot can perform simple prompts on tasks that require semantic understanding that goes far beyond the data the robot was trained on:

Here’s another task the robot solved in the real-world that requires understanding of the real-world that was obtained through web training data:

Fascinating, isn’t it?

Teaching robots to handle intricate, abstract tasks in changing environments has traditionally demanded massive amounts of data and laborious training across all conceivable scenarios. Now, RT-2 represents a quantum leap forward, bypassing this exhaustive training process through knowledge transfer from extensive web data.

Earlier advancements allowed robots to reason, dissect multi-step problems, and learn from each other using models like the PaLM-E vision model and RT-1 transformer. The introduction of RT-2 takes this a step further, consolidating reasoning and action into one model. It eliminates the need for explicit step-by-step training by using its language and vision training data to perform even untrained tasks.

For instance, without any explicit training to identify or dispose of trash, the RT-2 model can identify and throw away garbage, thanks to its ability to draw on a vast corpus of web data. It can handle the abstract concept of identifying objects like a banana peel or a chip bag as trash after their use.

Crucially, RT-2 has shown a marked improvement in robots’ adaptability to novel situations. During 6,000 robotic trials, the model performed at par with the previous model, RT-1, on trained tasks. However, its performance nearly doubled on novel, unseen scenarios from RT-1’s 32% to 62%, indicating a significant leap in robots’ ability to transfer learned concepts to new situations.

While the journey towards ubiquitous, helpful robots in human-centric environments is still ongoing, the arrival of RT-2 represents a thrilling leap forward. It demonstrates how rapid advancements in AI are accelerating progress in robotics, opening up exciting possibilities for developing more versatile, general-purpose robots.

πŸ”— You can read the full paper here:

Also check out our Finxter article on embodied AI:

πŸ’‘ Recommended: Cross-Species Cooperation: Uniting Humans and Embodied AI through English Language