Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Authors: Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform extensive experiments on the challenging CALVIN benchmark and a real robot. On CALVIN benchmark, our method outperforms state-of-the-art baseline methods and improves the success rate from 88.9% to 94.9%.
Researcher Affiliation Industry Hongtao Wu , Ya Jing , Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, Tao Kong Byte Dance Research {wuhongtao.123,kongtao}@bytedance.com
Pseudocode No The paper describes the model architecture and training process using text and diagrams, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes Project page: https://GR1-Manipulation.github.io
Open Datasets Yes The data for the large-scale video generative pre-training are sourced from the recently proposed Ego4D dataset (Grauman et al., 2022) which contains massive-scale human-object interactions. We perform experiments on the challenging CALVIN benchmark (Mees et al., 2022c).
Dataset Splits Yes We perform experiments on two splits of data: ABCD D and ABC D. The training dataset contains over 20k expert trajectories paired with language instruction labels. To study the data efficiency, we train on 10% data of the full training dataset from ABCD D split. Specifically, we sample 66 trajectories for each of the 34 tasks, i.e. 2244 trajectories, from the total 22,966 training trajectories.
Hardware Specification No In real robot experiments, we use a 7-Do F Kinova Gen2 robot mounted with a Real Sense camera on its end-effector. A Kinect Azure camera is used to provide the static view of the scene. (This describes the hardware for the real robot setup, but there is no specific mention of the computational hardware like GPU models, CPU types, or memory used for training the models.)
Software Dependencies No We apply dropout and use Adam W (Loshchilov & Hutter, 2017) with cosine learning rate decay (Loshchilov & Hutter, 2016) to optimize the network. (While optimizers and components like CLIP and MAE are mentioned, no specific version numbers for software libraries or frameworks (e.g., PyTorch, TensorFlow, Python, CUDA) are provided.)
Experiment Setup Yes Hyperparameters for pre-training and finetuning on CALVIN data are shown in Tab 3. Table 3: Training Hyperparameters - batch size 1024 512, learning rate 3.6e-4 1e-3, dropout 0.1 0.1, optimizer Adam W Adam W, learning rate schedule cosine decay cosine decay, warmup epochs 5 1, training epochs 50 20.