3D-VLA: A 3D Vision-Language-Action Generative World Model
Authors: Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications. 5. Experiments 3D-VLA is a versatile 3D-based generative world model that can perform reasoning and grounding in the 3D world, imagine multi-modal goal content, and generate actions for robot manipulation. In this section, we evaluate 3D-VLA in three aspects: 3D reasoning and localization, multi-modal goal generation, and embodied action planning. |
| Researcher Affiliation | Collaboration | 1University of Massachusetts Amherst 2Shanghai Jiao Tong University 3South China University of Technology 4Wuhan University 5Massachusetts Institute of Technology 6University of California, Los Angeles 7MIT-IBM Watson AI Lab. Correspondence to: Chuang Gan <ganchuang1990@gmail.com>. |
| Pseudocode | No | The paper describes its methodology in narrative text and does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://vis-www.cs.umass.edu/3dvla) but does not explicitly state that the source code for the methodology described in the paper is released or available there, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We select 12 datasets (Brohan et al., 2022; Jang et al., 2022; Walke et al., 2023; Lynch et al., 2023; Feng et al., 2023; Chen et al., 2023a; Dass et al., 2023; Mandlekar et al., 2019; Mees et al., 2023; Shah et al., 2023; Sawhney et al., 2021; Sermanet et al., 2023) from the Open-X Embodiment Dataset (Padalkar et al., 2023). They have high-quality images with linguistic instructions in the real world but lack more in-depth information and 3D annotations. We also select datasets with excellent depth information, such as Dobb-E (Shafiullah et al., 2023) and RH20T (Fang et al., 2023). Additionally, we use datasets collected from two simulator environments, RLBench (James et al., 2020) and CALVIN (Mees et al., 2022). We utilize several human-object interaction datasets, including datasets without depth information, such as Epic-Kitchens (Damen et al., 2018), and datasets with better 3D annotations, such as HOI4D (Liu et al., 2022). |
| Dataset Splits | No | The paper mentions using 'held-in datasets' for evaluation and sampling from a 'test set', but does not provide specific percentages or absolute counts for training, validation, and test splits needed for reproduction, nor does it detail a specific splitting methodology beyond the use of existing dataset splits. |
| Hardware Specification | Yes | We train our RGB-D editing diffusion model on 6 16 V100 GPUs. We train 3D-VLAs on 6 32 V100s. We train 3D-VLAs for a maximum of epochs of 30 on 6 64 V100s. |
| Software Dependencies | Yes | We utilize BLIP2-Flan T5XL (Li et al., 2023b) as our pretrained model. For RGBD to RGBD generation, we employ Stable Diffusion V1.4 (Rombach et al., 2022) as our pretrained model... For point-to-point generation, we use Point-E (Nichol et al., 2022) as the pretrained model... We utilize Lo RA (Hu et al., 2021) to fine-tune different diffusion models. We use spa Cy (Honnibal & Montani, 2017) to parse the instructions... We utilize a pre-trained grounding model (e.g., Grounded-SAM (Ren et al., 2024))... The Chat GPT version used in our paper is GPT-3.5-turbo-0125. |
| Experiment Setup | Yes | We train at 256 256 resolution with batch size of 32 on each GPU. We use a learning rate of 10 4. The batch size is set to 4 on each node during training. Additionally, we apply a linear warmup of the learning rate during the initial 1K steps, increasing from 10 8 to 10 5, followed by a cosine decay with a minimum learning rate of 10 6. The Adam W optimizer is used, with beta1 = 0.9, beta2 = 0.999, and a weight decay of 0.05. |