Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Authors: Jiachen Li, Qiaozi Gao, Michael Johnston, Xiaofeng Gao, Xuehai He, Hangjie Shi, Suhaila Shakiah, Reza Ghanadan, William Yang Wang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we evaluate the efficacy of our method on the VIMA-BENCH (Jiang et al., 2023) and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: https://midas-icml.github.io/. We compare our methods with various baselines from the VIMA paper (Jiang et al., 2023) on the VIMA-BENCH. All baseline methods only conduct multi-task imitation learning without pretraining. We conduct extensive experiments to study how our model design and training pipeline impacts the robot manipulation, focusing on the effectiveness of our pretraining strategy and prompt encoding. We also examine the impact of data scaling and model size. Appendix A presents individual task success rate for all methods and further ablate the decoderonly architecture of our model. Appendix E studies the effectiveness of the number of gradient steps.
Researcher Affiliation Collaboration Work being done during internship at Amazon AGI. 1Department of Computer Science, University of California, Santa Barbara, USA 2Amazon AGI 3Department of Computer Science, University of California, Santa Cruz, USA. Correspondence to: Jiachen Li <jiachen li@ucsb.edu>.
Pseudocode Yes The pseudo-codes (Algorithm 1) and detailed hyper-parameters (HP) are available in Appendix B.
Open Source Code No The paper provides a 'Project page: https://midas-icml.github.io/', which is typically a demonstration or overview page, not explicitly stated as a code repository. There is no direct statement of code release for the methodology described in the paper.
Open Datasets Yes Empirically, we evaluate the efficacy of our method on the VIMA-BENCH (Jiang et al., 2023). VIMA-BENCH (Jiang et al., 2023) is built on top of the Ravens (Zeng et al., 2021; Shridhar et al., 2023) simulator and contains 17 types of tabletop manipulation tasks. Expert demonstration are provided for 13 tasks as the training data, with 50K trajectories per task.
Dataset Splits Yes VIMA-BENCH establishes a four-level protocol to evaluate progressively stronger generalization, ranging from placement generalization (L1), combinatorial generalization (L2), novel object generalization (L3) and novel task generalization (L4). Expert demonstration are provided for 13 tasks as the training data, with 50K trajectories per task. The other 4 tasks are included into the L4 task suite.
Hardware Specification Yes We conduct our experiments on cluster nodes, each with 8 NVIDIA-A10G.
Software Dependencies No The paper mentions software components like 'pretrained LM (T5-base)' and 'Adam W' as optimizer, but does not specify version numbers for general software dependencies like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes The pseudo-codes (Algorithm 1) and detailed hyper-parameters (HP) are available in Appendix B. Table 19 presents the HP for our training pipeline, including Learning Rate (LR) 1e-4, Batch Size 128, Training epochs, Warmup Steps, Dropout, and Optimizer Adam W.