An Embodied Generalist Agent in 3D World

Authors: Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate LEO s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents.
Researcher Affiliation Academia Jiangyong Huang * 1 2 Silong Yong * 1 3 Xiaojian Ma * 1 Xiongkun Linghu * 1 Puhao Li 1 3 Yan Wang 1 Qing Li 1 Song-Chun Zhu 1 2 3 Baoxiong Jia 1 Siyuan Huang 1 *Equal contribution Research lead 1State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2Peking University 3Tsinghua University.
Pseudocode No The paper does not contain a pseudocode block or algorithm labeled as such.
Open Source Code Yes Code and data are available on project page.
Open Datasets Yes Since LEO is a generalist agent that receives multi-modal inputs and follows instructions, we adopt the two-stage training proposed by Liu et al. (2023b) and split the data into two sets: (i) LEO-align (Sec. 3.1) that focuses on 3D visionlanguage (VL) alignment to bridge the gap between 3D scene representation and natural language; and (ii) LEOinstruct (Sec. 3.2) that targets at 3D VLA instruction tuning to endow LEO with various capabilities. The statistics and examples of these datasets can be found in Tab. 1 and Appendix C, respectively.
Dataset Splits Yes The evaluation is conducted on the original validation split of the MP3D Obj Nav task and the newly introduced HM3D Obj Nav task (Ramakrishnan et al., 2021).
Hardware Specification Yes Type of GPUs NVIDIA A100
Software Dependencies No The paper does not provide specific version numbers for software dependencies beyond the general mention of models and frameworks like Open CLIP Conv Next, Vicuna-7B, PyTorch, etc.
Experiment Setup Yes Table A.13: Hyperparameters for the instruction-tuning stage. Hyperparameter Value Optimizer Adam W Weight decay 0.05 Betas [0.9, 0.999] Learning rate 3e-5 Warmup steps 400 Number of workers 4 Parallel strategy DDP Type of GPUs NVIDIA A100 Number of GPUs 4 Accumulate gradient batches 5 Batch size per GPU (total) 4 (80) Training precision bfloat16 Gradient norm 5.0 Epochs 10