Vision-Language Foundation Models as Effective Robot Imitators

Authors: Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, Hang Li, Tao Kong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a way of making use of existing vision-language models (VLMs) with fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed Robo Flamingo, built upon the open-source VLMs, Open Flamingo. Unlike prior works, Robo Flamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly finetuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides Robo Flamingo the flexibility for open-loop control and deployment on low-performance platforms. By surpassing the state-of-the-art performance on the benchmark by a significant margin, we demonstrate that Robo Flamingo presents itself as an effective and competitive alternative for adapting VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. Robo Flamingo can be trained or evaluated on a single GPU server, and we believe it has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy. Codes and models will be public.
Researcher Affiliation Collaboration Xinghang Li1,2, , Minghuan Liu2,3, , Hanbo Zhang2, Cunjun Yu4, Jie Xu2, Hongtao Wu2, Chilam Cheang2, Ya Jing2, Weinan Zhang3, Huaping Liu1,B, Hang Li2, Tao Kong2,B 1Tsinghua University, 2Byte Dance Research, 3Shanghai Jiao Tong University, 4National University of Singapore lixingha23@mails.tsinghua.edu.cn, hpliu@tsinghua.edu.cn, {minghuanliu, wnzhang}@sjtu.edu.cn, kongtao@bytedance.com
Pseudocode No No pseudocode or algorithm blocks were found in the paper. Figures 2 and 11 illustrate the framework and policy heads, but they are diagrams, not structured pseudocode.
Open Source Code No Codes and models will be public.
Open Datasets Yes We choose CALVIN (Mees et al., 2022b), an open-source simulated benchmark to learn long-horizon language-conditioned tasks, as our testbed, and the corresponding datasets as our imitation learning demonstration data. CALVIN encompasses a total of 34 distinct tasks and evaluates 1000 unique instruction chains for sequential tasks. In each experiment, the robot is required to successfully complete sequences of up to five language instructions consecutively. ... We also conduct an ablation study to freeze the pre-trained VLM and only train the policy head (denoted as No VL Finetune). As shown in Fig. 3 (b), we can see that vision-language pre-training crucially improves the downstream robotic manipulation by a large margin. Besides, tuning on the VL model itself on robotic tasks is indispensable due to the limited capacity of the policy head. ... we conduct further experiments by testing the fine-tuned Robo Flamingo model (the M-3B-IFT variant) on the COCO image caption and VQAv2, which verify our conjecture (see Tab. 6).
Dataset Splits Yes We choose CALVIN (Mees et al., 2022b), an open-source simulated benchmark to learn long-horizon language-conditioned tasks, as our testbed, and the corresponding datasets as our imitation learning demonstration data. CALVIN encompasses a total of 34 distinct tasks and evaluates 1000 unique instruction chains for sequential tasks. In each experiment, the robot is required to successfully complete sequences of up to five language instructions consecutively. The policy for each consecutive task is dependent on a goal instruction, and the agent advances to the subsequent goal only if it successfully accomplishes the current task. The dataset contains four splits for environments A, B, C, and D. Each consists of 6 hours of human-teleoperated recording data (more than 2 million steps) that might contain sub-optimal behavior, and only 1% of that data is annotated with language instructions (~ 24 thousand steps). See Fig. 4 in Appendix A.1 for a more detailed description and visualized examples of the benchmark. We train Robo Flamingo (with the M-3B-IFT backbone) using demonstrations only with language annotation from all 4 splits (A, B, C, and D), and evaluate the imitation performance on episodes sampled on split D (ABCD D). For vision generalization, we train models on splits A, B, and C and test on split D, which presents a different vision context.
Hardware Specification Yes All experiments involved in this paper are conducted on a single GPU server with 8 NVIDIA Tesla A100 GPUs, and the default batch size is 6 on each GPU.
Software Dependencies No The paper mentions using Open Flamingo, LLaMA, GPTNeox, MPT models, and GPT4 for instruction enrichment. However, it does not specify explicit software dependencies with version numbers (e.g., PyTorch version, Python version, specific library versions) for reproducing the experimental environment.
Experiment Setup Yes In the training procedure, we follow the fine-tuning paradigm of Open Flamingo by only training the parameters of the resampler, the gated cross-attention module of each decoder layer, and the policy head while freezing all other parameters. All experiments involved in this paper are conducted on a single GPU server with 8 NVIDIA Tesla A100 GPUs, and the default batch size is 6 on each GPU. The MPT-3B model takes 13 hours of training per epoch and achieves the best performance at the 3rd epoch, while the MPT-9B model also takes 26 hours of training per epoch and achieves the best performance at the 4rd epoch.