VIMA: Robot Manipulation with Multimodal Prompts
Authors: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to 2.9 task success rate given the same training data. With 10 less training data, VIMA still performs 2.7 better than the best competing variant. |
| Researcher Affiliation | Collaboration | 1Stanford University; 2Macalester College, now at Allen Institute for AI; 3NVIDIA; 4Caltech; 5Tsinghua; 6UT Austin. Work done during the first author s internship at NVIDIA. |
| Pseudocode | Yes | Pseudocode 1: Cross-attention operation that conditions the trajectory history on prompt. We repetitively alternate crossattention and self-attention to model the trajectory given a specific task. |
| Open Source Code | Yes | Code and video demos are available at vimalabs.github.io. |
| Open Datasets | Yes | We open-source the simulation environment, training dataset, algorithm code, and pre-trained model checkpoints to ensure reproducibility and facilitate future work from the community. These materials along with video demos are available at vimalabs.github.io. |
| Dataset Splits | Yes | After training, we select model checkpoints for evaluation based on the aggregated accuracy on a held-out validation set. |
| Hardware Specification | Yes | All experiments are conducted on cluster nodes, each with 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | We implement all models in Py Torch (Paszke et al., 2019) and adapt Transformer-related implementation from Wolf et al. (2019). |
| Experiment Setup | Yes | Training hyperparameters are provided in Table 7. |