Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VIMA: Robot Manipulation with Multimodal Prompts
Authors: Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to 2.9 task success rate given the same training data. With 10 less training data, VIMA still performs 2.7 better than the best competing variant. |
| Researcher Affiliation | Collaboration | 1Stanford University; 2Macalester College, now at Allen Institute for AI; 3NVIDIA; 4Caltech; 5Tsinghua; 6UT Austin. Work done during the first author s internship at NVIDIA. |
| Pseudocode | Yes | Pseudocode 1: Cross-attention operation that conditions the trajectory history on prompt. We repetitively alternate crossattention and self-attention to model the trajectory given a specific task. |
| Open Source Code | Yes | Code and video demos are available at vimalabs.github.io. |
| Open Datasets | Yes | We open-source the simulation environment, training dataset, algorithm code, and pre-trained model checkpoints to ensure reproducibility and facilitate future work from the community. These materials along with video demos are available at vimalabs.github.io. |
| Dataset Splits | Yes | After training, we select model checkpoints for evaluation based on the aggregated accuracy on a held-out validation set. |
| Hardware Specification | Yes | All experiments are conducted on cluster nodes, each with 8 NVIDIA V100 GPUs. |
| Software Dependencies | Yes | We implement all models in Py Torch (Paszke et al., 2019) and adapt Transformer-related implementation from Wolf et al. (2019). |
| Experiment Setup | Yes | Training hyperparameters are provided in Table 7. |