the study
Published July 28, 2023
Evgen Chebotar, Tianhe Yu
Robotic Transformer 2 (RT-2) is a new Vision Language Action (VLA) model that learns from both web and robot data and converts this knowledge into generalized instructions for robot control.
Because large-capacity visual language models (VLMs) are trained on web-scale datasets, these systems are very good at recognizing visual and linguistic patterns and working across different languages. But for robots to achieve similar levels of capability, we need to directly collect robot data across all objects, environments, tasks, and situations.
In our paper, we introduce Robotic Transformer 2 (RT-2), a novel vision language action (VLA) model that learns from both web and robotics data and transforms this knowledge into generalized instructions for robot control. ). Scale ability.
A visual language model (VLM) pre-trained on web-scale data becomes RT-2, a visual language action (VLA) model that can learn from RT-1 robotics data to control a robot.
This work builds on Robotic Transformer 1 (RT-1), a model trained on multi-task demonstrations that can learn task-object combinations in robot data. Specifically, our work used RT-1 robot demonstration data collected with 13 robots over a period of 17 months in an office kitchen environment.
RT-2 improved generalization capabilities and semantic and visual understanding beyond publicly available robot data. This includes interpreting new commands and responding to user commands by performing basic inferences, such as reasoning about object categories and high-level descriptions.
We also show that RT-2 can perform multi-step semantic reasoning by incorporating thought-chain reasoning. For example, deciding which object can be used as an improvised hammer (stone) or which type of drink is best for a tired person. (Energy drink).
Adapting VLM to robot control
RT-2 is built on a VLM that takes one or more images as input and conventionally generates a set of tokens representing natural language text. Such VLMs have been successfully trained on web-scale data to perform tasks such as visual question answering, image captioning, and object recognition. In our research, we adapt the Pathways Language and Image Model (PaLI-X) and the Pathways Language Model Embodied (PaLM-E) to serve as the backbone of RT-2.
To control a robot, you need to train it to output actions. We address this challenge by representing actions as tokens in the model’s output, similar to language tokens, and by writing actions as strings that can be processed by standard natural language tokenizers, as shown below. I will.
Representation of action strings used in RT-2 training. An example of such a string is a series of robot action token numbers, such as “1 128 91 241 5 101 127 217.”
This string starts with a flag indicating whether to continue or end the current episode without executing subsequent commands, followed by commands to change the position and rotation of the end effector and the desired extension of the robot gripper. .
We show that a VLM model can be trained on robot data by using a discretized version of the same robot actions as RT-1 and converting it to a string representation. No input and output space is required for such a model. It has changed.
RT-2 Architecture and Training: Jointly fine-tune pre-trained VLM models based on robotics and web data. The resulting model captures the robot’s camera images and directly predicts the actions the robot will perform.
Generalization and emergent skills
We performed a series of qualitative and quantitative experiments on the RT-2 model with over 6,000 robot tests. In exploring new capabilities for RT-2, we first looked for tasks that required combining knowledge from web-scale data with robot experience, and then focused on three tasks: symbol understanding, reasoning, and human recognition. Defined skill categories.
Each task required the ability to understand visual-semantic concepts and perform robot controls that operate based on these concepts. You need commands like “Pick up the bag that is about to fall off the table” or “Move the bananas to the sum of 2 plus 1.” In this command, the robot is asked to perform manipulation tasks on objects and scenarios not found in the robot data. Manipulate knowledge transformed from web-based data.
Examples of new robot skills that are not present in the robot data and require knowledge transfer from web pre-training.
In all categories, improved generalization performance (3 An improvement of more than 20% was observed.
Emergent skill assessment success rate: Our RT-2 model outperforms both previous Robot Transformer (RT-1) and Visual Pretraining (VC-1) baselines.
We also perform a series of quantitative evaluations, starting with the original RT-1 task with examples in the robot data, and continuing with varying degrees of never-before-seen objects, backgrounds, and environments that the robot requires. I did. Learn generalization from VLM pre-training.
An example of an environment created by a robot that has never been seen before. RT-2 generalizes to new situations.
RT-2 maintains the original task performance seen in the robot data, increases the robot’s performance on never-before-seen scenarios from 32% of RT-1 to 62%, and uses extensive pre-training. showed considerable benefits.
Additionally, we have baselines pre-trained on vision-only tasks such as VC-1 and Reusable Representations for Robot Manipulation (R3M), as well as algorithms that use VLM for object identification such as open-world object manipulation (MU). .
RT-2 achieves high performance on visible tasks within the distribution and outperforms multiple baselines on invisible tasks outside the distribution.
We evaluated the model on a suite of open-source language tables for robot tasks and achieved a 90% success rate in simulations with BC-Z (72%), RT-1 (74%), and RT-1 (74%). Significant improvements over previous baselines, including: LAVA (77%).
We then evaluated the same model in the real world (as it was trained on simulations and real data) and demonstrated its ability to generalize to new objects, as shown below. No objects other than the blue cube were present in training. Dataset.
RT-2 has excellent performance in real robot language table tasks. No objects other than the blue cube were present in the training data.
Inspired by the chain-of-thought prompting method used in LLM, we explore the model by combining robot control and chain-of-thought reasoning to enable learning long-term planning and low-level skills within a single model. did.
In particular, we’ve fine-tuned the RT-2 variant with just a few hundred gradient steps, improving its ability to use language and action together. We then extended the data to add an additional “planning” step. First write in natural language the purpose of the action that the robot is trying to perform, then write the “action” and the action token. Here we provide an example of such reasoning and the resulting robot behavior.
Chain of thought reasoning enables the learning of self-contained models that can both plan long-term skill sequences and predict robot behavior.
This process allows RT-2 to execute more complex commands that require reasoning about intermediate steps needed to accomplish the user’s instructions. Thanks to the VLM backbone, RT-2 can also plan from both image and text commands, allowing for visually grounded planning. On the other hand, current planning and action approaches like SayCan cannot see the real world and rely entirely on language.
Evolution of robot control
RT-2 shows that a Vision Language Model (VLM) can be transformed into a powerful Vision Language Action (VLA) model. The VLA model can directly control the robot by combining VLM pre-training and robot data.
With two instantiations of VLA based on PaLM-E and PaLI-X, RT-2 offers significant improvements in robot policy and, more importantly, the generalization performance and generalization inherited from earlier web-scale vision languages. The emergence function has been greatly improved. -training.
RT-2 has the potential to easily and effectively modify existing VLM models as well as build general-purpose physical robots capable of reasoning, problem solving, and interpreting information to perform a variety of real-world tasks. It shows gender. world.
Acknowledgment
We would like to thank the co-authors of this work: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehan Han, Karol Huisman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil – Joshi, Ryan Julien, Dmitry Kalashnikov, Yuheng Kuan, Isabel Real, Lisa Li, Zhangwei Edward Li, Sergei Levin, Yao Lu, Henryk Mikhalevskiy, Igor・Molducci, Karl Pertsch, Kanishka Rao, Krista Reiman, Michael Leau, Grecia Salazar, Panag Sanketty, Pierre Selmane, Jaspear Singh, Anikait Singh, Radu Sorikak To, Phuong Tran, Vincent Vanhoek, Quang Vuong, Aizan Wahid, Stephen Welker, Paul Wallhart, Jarin Wu, Fei Xia, Ted Hsiao, Peng Xu, Sichun Xu, Tianhe Yu, Briana Zitkovich, and Fred Alcobar, Jodi Lynn Andres, Carolina Parada, Joseph Davis, Rochelle Dela Cruz, Jessica Gomez, Gavin Gonzalez, John Gilyard, and Thomas Jackson. , Gee Tan, Scott Lehrer, Dee M, Utsav Mara, Sara Nguyen, Jane Park, Emily Perez, Elio Prado, Jonel Quiambao, Clayton We would like to thank Tan, Jodexty Therlonge, Eleanor Tomlinson, Wenxuan Zhou, and the Google DeepMind team for their help and feedback.