Cross-Species Cooperation: Uniting Humans and Embodied AI through English Language

I stumbled upon an interesting new MIT and IBM Watson AI Lab project titled Building Cooperative Embodied Agents Modularly with Large Language Models.

AI Breaks Out Of Your Screen

If you’re like me, you are not deep into AI research, so let’s start with the question: What are embodied agents anyway?

👨‍💻 Definition: Embodied agents represent the thrilling intersection of AI and robotics, where machines with physical forms experience and influence the world directly. They aren’t merely digital constructs—they’re AI-powered entities equipped with sensors and actuators, allowing them to perceive and interact with their environments.

Whether it’s a self-driving car understanding traffic patterns, a drone soaring the skies, or a humanoid robot aiding in industrial processes, embodied agents are revolutionizing our interaction with technology due to their ability to adapt to real-world constraints and possibilities.

Think of it this way: AI breaks out of the screen you’re currently reading this on and enters the real world that’s all around you. 🤖

TLDR: LLM Agents Communicate in Human Language?

This paper explores using Large Language Models (LLMs) like GPT-4 in multi-agent cooperation for embodied AI tasks. Like Alice and Bob, two humanoid agents working together to clean your house:

The research introduces a new framework that allows these AI-powered physical entities to plan, communicate, and collaborate with others, including humans, in various embodied environments.

The paper reveals that these LLM-empowered agents can outperform established planning-based methods and generate effective communication strategies without needing specific fine-tuning or prompts.

Particularly notable is the finding that these agents, when communicating in natural language, tend to foster more trust and cooperation with humans. This work underscores the promising potential of LLMs in embodied AI and sets the stage for further research into multi-agent cooperation.

I imagine a cyber-biological system where humans and machine interact together on meta-tasks using the English natural language instead of a less inclusive “low-level” programming language.

Demonstration

Several demonstration videos are already published on the official website.

The first video shows how Bob and Alice communicate using natural language and Alice asking Bob if he can search for the orange in another room. Bob and Alice could be autonomous agents like the Optimus Tesla Bot, or an LLM and a human.
The second video shows how Alice suggests to Bob using the containers to collect and transport the apple and banana to the bed (*yummy 😋).
The third video shows how Bob and Alice do some deductive reasoning (the thought bubbles) to infer that Bob already has a full container so it should ask Alice LLM to help it.
The fourth video shows nicely how the autonomous agents use common sense to search and move items, working together on the great task.

Together, this is an impressive demonstration of the usability of large language models (LLMs) for real-world applications. The large body of human text the LLMs have been trained on has helped create a common ground (“common sense”) that is expressed in the English language and valuable for efficient interactions of cyber-biological systems deployed in the real world.

It reminds me of the same “naive-looking” planning approach used by digital autonomous agents such as Auto-GPT or BabyAGI:

How Does It Work?

💡 Imagine the future of AI as a well-coordinated ensemble, a seamless blend of observation, belief, communication, reasoning, and planning — this is the central theme of this latest research on LLMs applied to multi-agent cooperation for embodied AI.

The five-module framework operates like a smooth assembly line.

First, the Observation Module processes raw data that the AI receives from its environment, acting as the eyes and ears of the embodied agent.

Once processed, the data is then handed over to the Belief Module, which updates the agent’s internal understanding of its surroundings and other agents, serving as the mind that makes sense of what the agent perceives.

This belief, now armed with information from previous actions and dialogues, is utilized to build prompts for two key components powered by LLMs – the Communication and Reasoning Modules. These modules are the agents’ voice and brain respectively.

The Communication Module uses the LLMs to generate messages, enabling effective interaction with other agents and humans.
Meanwhile, the Reasoning Module, also leveraging the LLMs, deliberates on high-level plans, guiding the agent’s strategic moves.

The final part of this intelligent relay is the Planning Module. Taking the baton from the Reasoning Module, it decides on the immediate, primitive action the agent should take according to the high-level plan, akin to the agent’s hands and feet, executing the decisions made.

👨‍💻 Overall, the researchers have ingeniously devised a framework that mimics human-like cooperative behavior, using a combination of LLMs and embodied AI. The modules work together to enable intelligent planning, communication, and cooperation in a multi-agent setting, marking a significant milestone in the evolution of AI.

You can read more in their research paper or project website.

You may also want to check out our latest report on the economic disruptions introduced by generative AI:

By the way, I used the following prompt to generate the images on Midjourney: “a female-looking beautiful humanoid robot dressed like a home assistant breaks out of a giant computer screen in an old-school castle (old money) environment smiling to a male coder sitting on a couch with his notebook on his knees” — could be improved I know. 😅