reproducibilityindex.ai

An Embodied Generalist Agent in 3D World

Authors: Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate LEO s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents.
Researcher Affiliation	Academia	Jiangyong Huang * 1 2 Silong Yong * 1 3 Xiaojian Ma * 1 Xiongkun Linghu * 1 Puhao Li 1 3 Yan Wang 1 Qing Li 1 Song-Chun Zhu 1 2 3 Baoxiong Jia 1 Siyuan Huang 1 *Equal contribution Research lead 1State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2Peking University 3Tsinghua University.
Pseudocode	No	The paper does not contain a pseudocode block or algorithm labeled as such.
Open Source Code	Yes	Code and data are available on project page.
Open Datasets	Yes	Since LEO is a generalist agent that receives multi-modal inputs and follows instructions, we adopt the two-stage training proposed by Liu et al. (2023b) and split the data into two sets: (i) LEO-align (Sec. 3.1) that focuses on 3D visionlanguage (VL) alignment to bridge the gap between 3D scene representation and natural language; and (ii) LEOinstruct (Sec. 3.2) that targets at 3D VLA instruction tuning to endow LEO with various capabilities. The statistics and examples of these datasets can be found in Tab. 1 and Appendix C, respectively.
Dataset Splits	Yes	The evaluation is conducted on the original validation split of the MP3D Obj Nav task and the newly introduced HM3D Obj Nav task (Ramakrishnan et al., 2021).
Hardware Specification	Yes	Type of GPUs NVIDIA A100
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies beyond the general mention of models and frameworks like Open CLIP Conv Next, Vicuna-7B, PyTorch, etc.
Experiment Setup	Yes	Table A.13: Hyperparameters for the instruction-tuning stage. Hyperparameter Value Optimizer Adam W Weight decay 0.05 Betas [0.9, 0.999] Learning rate 3e-5 Warmup steps 400 Number of workers 4 Parallel strategy DDP Type of GPUs NVIDIA A100 Number of GPUs 4 Accumulate gradient batches 5 Batch size per GPU (total) 4 (80) Training precision bfloat16 Gradient norm 5.0 Epochs 10