Google’s RT-2 Enables Robots To Learn From YouTube Videos

In a game-changing development in robotics, Google DeepMind’s new artificial intelligence model, RT-2 (Robotics Transformer 2), seamlessly combines vision, language, and action to help robots understand and perform tasks with greater adaptability.

🔗 Source: https://robotics-transformer2.github.io/

The RT-2 is a Vision-Language-Action (VLA) model, unprecedented in its capacity to integrate text and images from the internet and use the acquired knowledge to dictate robotic behavior. Acting as a sort of ‘interpreter‘, RT-2 can ‘speak robot‘, translating vision and language cues into physical actions.

♥️ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month

For example, the Google researchers came up with the idea of representing robot actions as text strings such as "1 128 91 241 5 101 127 217" that represent concrete behavior such as positional or rotational change of the real robot:

And because actions are represented as strings, this can be seen as just another language to learn. Roughly speaking, training a robot in the real-world has now become a problem that can be solved by a large language model (LLM)! And we can use digital data from the web to do so instead of needing to learn from real-world data only!!!

If you forget all else from this article, I want you to take this one idea out of it:

You can now train robots on YouTube! 🤯

Here’s an example of how Google’s robot can perform simple prompts on tasks that require semantic understanding that goes far beyond the data the robot was trained on:

Here’s another task the robot solved in the real-world that requires understanding of the real-world that was obtained through web training data:

Fascinating, isn’t it?

Teaching robots to handle intricate, abstract tasks in changing environments has traditionally demanded massive amounts of data and laborious training across all conceivable scenarios. Now, RT-2 represents a quantum leap forward, bypassing this exhaustive training process through knowledge transfer from extensive web data.

🔗 **Source**: https://robotics-transformer2.github.io/

Earlier advancements allowed robots to reason, dissect multi-step problems, and learn from each other using models like the PaLM-E vision model and RT-1 transformer. The introduction of RT-2 takes this a step further, consolidating reasoning and action into one model. It eliminates the need for explicit step-by-step training by using its language and vision training data to perform even untrained tasks.

For instance, without any explicit training to identify or dispose of trash, the RT-2 model can identify and throw away garbage, thanks to its ability to draw on a vast corpus of web data. It can handle the abstract concept of identifying objects like a banana peel or a chip bag as trash after their use.

Crucially, RT-2 has shown a marked improvement in robots’ adaptability to novel situations. During 6,000 robotic trials, the model performed at par with the previous model, RT-1, on trained tasks. However, its performance nearly doubled on novel, unseen scenarios from RT-1’s 32% to 62%, indicating a significant leap in robots’ ability to transfer learned concepts to new situations.

While the journey towards ubiquitous, helpful robots in human-centric environments is still ongoing, the arrival of RT-2 represents a thrilling leap forward. It demonstrates how rapid advancements in AI are accelerating progress in robotics, opening up exciting possibilities for developing more versatile, general-purpose robots.

🔗 You can read the full paper here: https://robotics-transformer2.github.io/assets/rt2.pdf

Also check out our Finxter article on embodied AI: