RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Title:

RT-2: Vision-Language-Action Models
Transfer Web Knowledge to Robotic Control

Publish Date:

21 August 2024

Abstract:

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability…

Download Paper

Tags:

Robotics, VLM

Review:

Dennis Kuriakose

Date:

7 August 2024

Review:

Insight: The paper introduces RT-2 (Robotics Transformer 2), a model that integrates vision-language models (VLMs) trained on internet-scale data directly into robotic control systems. This integration allows robots to benefit from the generalization and semantic reasoning capabilities of VLMs, improving their ability to perform tasks in real-world environments. The key innovation is the use of vision-language-action (VLA) models, which express robotic actions as text tokens, enabling the models to be trained using the same methodology as VLMs.

Training Approach: RT-2 is developed by co-fine-tuning large VLMs with both internet-scale vision-language tasks (e.g., visual question answering) and robotic trajectory data. The actions are tokenized into text and included in the training data alongside natural language, allowing the model to learn robotic policies directly. This approach leverages the extensive pretraining of VLMs, making the model capable of mapping robot observations to actions effectively.

Results: The RT-2 model demonstrated significant improvements in generalization, being able to handle novel objects and instructions that were not part of the robotic training data. It also exhibited emergent capabilities, such as performing multi-stage reasoning and interpreting semantic cues (e.g., placing objects based on icons or reasoning about the best object to use as a tool). Over 6,000 evaluation trials showed that RT-2 outperformed previous models in terms of both generalization and the range of tasks it could handle.

Limitations and Future Directions:

The RT-2 model, despite its advanced capabilities, has several limitations:

Physical Skill Limitations:
- No New Motions: The model does not learn new physical skills purely from the web-scale pretraining of vision-language models (VLMs). It remains restricted to the motion skills seen during robot data training. The model's physical abilities are limited by the diversity of skills in the training dataset.
Computation Costs:
- High Real-Time Computation Costs: Running large Vision-Language-Action (VLA) models like RT-2 in real-time is computationally expensive. This cost could be a bottleneck in applications requiring high-frequency control.

Collationist.

Title: