In a game-changing development in robotics, Google DeepMind’s new artificial intelligence model, RT-2 (Robotics Transformer 2), seamlessly combines vision, language, and action to help robots understand and perform tasks with greater adaptability.
The RT-2 is a Vision-Language-Action (VLA) model, unprecedented in its capacity to integrate text and images from the internet and use the acquired knowledge to dictate robotic behavior. Acting as a sort of ‘interpreter‘, RT-2 can ‘speak robot‘, translating vision and language cues into physical actions.
β₯οΈ Info: Are you AI curious but you still have to create real impactful projects? Join our official AI builder club on Skool (only $5): SHIP! - One Project Per Month
For example, the Google researchers came up with the idea of representing robot actions as text strings such as "1 128 91 241 5 101 127 217" that represent concrete behavior such as positional or rotational change of the real robot:

And because actions are represented as strings, this can be seen as just another language to learn. Roughly speaking, training a robot in the real-world has now become a problem that can be solved by a large language model (LLM)! And we can use digital data from the web to do so instead of needing to learn from real-world data only!!!
If you forget all else from this article, I want you to take this one idea out of it:
You can now train robots on YouTube! π€―

Here’s an example of how Google’s robot can perform simple prompts on tasks that require semantic understanding that goes far beyond the data the robot was trained on:

Here’s another task the robot solved in the real-world that requires understanding of the real-world that was obtained through web training data:

Fascinating, isn’t it?
Teaching robots to handle intricate, abstract tasks in changing environments has traditionally demanded massive amounts of data and laborious training across all conceivable scenarios. Now, RT-2 represents a quantum leap forward, bypassing this exhaustive training process through knowledge transfer from extensive web data.

Earlier advancements allowed robots to reason, dissect multi-step problems, and learn from each other using models like the PaLM-E vision model and RT-1 transformer. The introduction of RT-2 takes this a step further, consolidating reasoning and action into one model. It eliminates the need for explicit step-by-step training by using its language and vision training data to perform even untrained tasks.
For instance, without any explicit training to identify or dispose of trash, the RT-2 model can identify and throw away garbage, thanks to its ability to draw on a vast corpus of web data. It can handle the abstract concept of identifying objects like a banana peel or a chip bag as trash after their use.
Crucially, RT-2 has shown a marked improvement in robots’ adaptability to novel situations. During 6,000 robotic trials, the model performed at par with the previous model, RT-1, on trained tasks. However, its performance nearly doubled on novel, unseen scenarios from RT-1’s 32% to 62%, indicating a significant leap in robots’ ability to transfer learned concepts to new situations.
While the journey towards ubiquitous, helpful robots in human-centric environments is still ongoing, the arrival of RT-2 represents a thrilling leap forward. It demonstrates how rapid advancements in AI are accelerating progress in robotics, opening up exciting possibilities for developing more versatile, general-purpose robots.
π You can read the full paper here: https://robotics-transformer2.github.io/assets/rt2.pdf
Also check out our Finxter article on embodied AI:
π‘ Recommended: Cross-Species Cooperation: Uniting Humans and Embodied AI through English Language
