3D-VLA: A 3D Vision-Language-Action Generative World Model

Authors: Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, Chuang Gan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on held-in datasets demonstrate that 3D-VLA significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications. 5. Experiments 3D-VLA is a versatile 3D-based generative world model that can perform reasoning and grounding in the 3D world, imagine multi-modal goal content, and generate actions for robot manipulation. In this section, we evaluate 3D-VLA in three aspects: 3D reasoning and localization, multi-modal goal generation, and embodied action planning.
Researcher Affiliation Collaboration 1University of Massachusetts Amherst 2Shanghai Jiao Tong University 3South China University of Technology 4Wuhan University 5Massachusetts Institute of Technology 6University of California, Los Angeles 7MIT-IBM Watson AI Lab. Correspondence to: Chuang Gan <ganchuang1990@gmail.com>.
Pseudocode No The paper describes its methodology in narrative text and does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper provides a project website link (https://vis-www.cs.umass.edu/3dvla) but does not explicitly state that the source code for the methodology described in the paper is released or available there, nor does it provide a direct link to a code repository.
Open Datasets Yes We select 12 datasets (Brohan et al., 2022; Jang et al., 2022; Walke et al., 2023; Lynch et al., 2023; Feng et al., 2023; Chen et al., 2023a; Dass et al., 2023; Mandlekar et al., 2019; Mees et al., 2023; Shah et al., 2023; Sawhney et al., 2021; Sermanet et al., 2023) from the Open-X Embodiment Dataset (Padalkar et al., 2023). They have high-quality images with linguistic instructions in the real world but lack more in-depth information and 3D annotations. We also select datasets with excellent depth information, such as Dobb-E (Shafiullah et al., 2023) and RH20T (Fang et al., 2023). Additionally, we use datasets collected from two simulator environments, RLBench (James et al., 2020) and CALVIN (Mees et al., 2022). We utilize several human-object interaction datasets, including datasets without depth information, such as Epic-Kitchens (Damen et al., 2018), and datasets with better 3D annotations, such as HOI4D (Liu et al., 2022).
Dataset Splits No The paper mentions using 'held-in datasets' for evaluation and sampling from a 'test set', but does not provide specific percentages or absolute counts for training, validation, and test splits needed for reproduction, nor does it detail a specific splitting methodology beyond the use of existing dataset splits.
Hardware Specification Yes We train our RGB-D editing diffusion model on 6 16 V100 GPUs. We train 3D-VLAs on 6 32 V100s. We train 3D-VLAs for a maximum of epochs of 30 on 6 64 V100s.
Software Dependencies Yes We utilize BLIP2-Flan T5XL (Li et al., 2023b) as our pretrained model. For RGBD to RGBD generation, we employ Stable Diffusion V1.4 (Rombach et al., 2022) as our pretrained model... For point-to-point generation, we use Point-E (Nichol et al., 2022) as the pretrained model... We utilize Lo RA (Hu et al., 2021) to fine-tune different diffusion models. We use spa Cy (Honnibal & Montani, 2017) to parse the instructions... We utilize a pre-trained grounding model (e.g., Grounded-SAM (Ren et al., 2024))... The Chat GPT version used in our paper is GPT-3.5-turbo-0125.
Experiment Setup Yes We train at 256 256 resolution with batch size of 32 on each GPU. We use a learning rate of 10 4. The batch size is set to 4 on each node during training. Additionally, we apply a linear warmup of the learning rate during the initial 1K steps, increasing from 10 8 to 10 5, followed by a cosine decay with a minimum learning rate of 10 6. The Adam W optimizer is used, with beta1 = 0.9, beta2 = 0.999, and a weight decay of 0.05.