VIMA: Robot Manipulation with Multimodal Prompts

Authors: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to 2.9 task success rate given the same training data. With 10 less training data, VIMA still performs 2.7 better than the best competing variant.
Researcher Affiliation Collaboration 1Stanford University; 2Macalester College, now at Allen Institute for AI; 3NVIDIA; 4Caltech; 5Tsinghua; 6UT Austin. Work done during the first author s internship at NVIDIA.
Pseudocode Yes Pseudocode 1: Cross-attention operation that conditions the trajectory history on prompt. We repetitively alternate crossattention and self-attention to model the trajectory given a specific task.
Open Source Code Yes Code and video demos are available at vimalabs.github.io.
Open Datasets Yes We open-source the simulation environment, training dataset, algorithm code, and pre-trained model checkpoints to ensure reproducibility and facilitate future work from the community. These materials along with video demos are available at vimalabs.github.io.
Dataset Splits Yes After training, we select model checkpoints for evaluation based on the aggregated accuracy on a held-out validation set.
Hardware Specification Yes All experiments are conducted on cluster nodes, each with 8 NVIDIA V100 GPUs.
Software Dependencies Yes We implement all models in Py Torch (Paszke et al., 2019) and adapt Transformer-related implementation from Wolf et al. (2019).
Experiment Setup Yes Training hyperparameters are provided in Table 7.