RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Authors: Yao Mu, Junting Chen, Qing-Long Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Robo Code X achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one embodied navigation task.
Researcher Affiliation Academia 1Department of Computer Science, The University of Hong Kong, Hong Kong 2Open GVLab, Shanghai AI Laboratory 3ETH Zurich 4Shanghai Jiao Tong University 5The Chinese University of Hong Kong 6Tsinghua University 7UC Berkely. Correspondence to: Ping Luo <pluo@cs.hku.hk>, Mingyu Ding <myding@berkeley.edu>.
Pseudocode Yes Listing 1. Pseudo-code of the Vision adapter
Open Source Code No The paper mentions 'More demos and information can be found in our homepage' but does not provide a direct link to the source code for the methodology, nor does it explicitly state that the code is publicly released or available in supplementary materials.
Open Datasets Yes We first randomly sample household scenes from the HM3D dataset (Ramakrishnan et al., 2021)... The objects are sampled from Google Scan Dataset (Downs et al., 2022), YCB Dataset (Calli et al., 2015), Omni Object3D Dataset (Wu et al., 2023b), and articulated object dataset AKB-48(Liu et al., 2022). The general vision language pre-training dataset we use contains Share GPT4V (Chen et al., 2023c) dataset, SVi T (Zhao et al., 2023) dataset, and the LLa VA Visual Instruct 150K dataset (Liu et al., 2023d).
Dataset Splits No The paper describes the datasets used for pre-training and supervised fine-tuning, including their ratios, but does not explicitly provide specific training/validation/test splits with percentages or counts for reproducing the experimental setup.
Hardware Specification No The paper mentions the robot arms used for real-world experiments (Franka Emika robot arm and UR5 robot arm) but does not provide specific details about the computational hardware (e.g., GPU/CPU models, memory) used for training or inference of the models.
Software Dependencies No The paper mentions software components like ROS, Gazebo, MoveIt, and OMPL, but does not provide specific version numbers for these or other key software dependencies (e.g., Python, deep learning frameworks) required for replication.
Experiment Setup Yes We commence pretraining the entire vision-language model on a pretraining dataset inclusive of general Visual Question Answering (VQA) data and generated multi-modal code generation data. Following this, we fine-tune the complete vision-language model using the supervised SFT dataset... During the pretraining stage, the Robo Code X Pretrain Dataset was combined with other general VQA data (Share GPT4V, SVIT, LLa VA-150K) in a 1:1 ratio. During the subsequent supervised fine-tuning (SFT) stage, the Robo Code X SFT Dataset was combined with other general VQA data at a ratio of 10:1.