An Embodied Generalist Agent in 3D World
Authors: Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate LEO s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation. Our ablative studies and scaling analyses further provide valuable insights for developing future embodied generalist agents. |
| Researcher Affiliation | Academia | Jiangyong Huang * 1 2 Silong Yong * 1 3 Xiaojian Ma * 1 Xiongkun Linghu * 1 Puhao Li 1 3 Yan Wang 1 Qing Li 1 Song-Chun Zhu 1 2 3 Baoxiong Jia 1 Siyuan Huang 1 *Equal contribution Research lead 1State Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2Peking University 3Tsinghua University. |
| Pseudocode | No | The paper does not contain a pseudocode block or algorithm labeled as such. |
| Open Source Code | Yes | Code and data are available on project page. |
| Open Datasets | Yes | Since LEO is a generalist agent that receives multi-modal inputs and follows instructions, we adopt the two-stage training proposed by Liu et al. (2023b) and split the data into two sets: (i) LEO-align (Sec. 3.1) that focuses on 3D visionlanguage (VL) alignment to bridge the gap between 3D scene representation and natural language; and (ii) LEOinstruct (Sec. 3.2) that targets at 3D VLA instruction tuning to endow LEO with various capabilities. The statistics and examples of these datasets can be found in Tab. 1 and Appendix C, respectively. |
| Dataset Splits | Yes | The evaluation is conducted on the original validation split of the MP3D Obj Nav task and the newly introduced HM3D Obj Nav task (Ramakrishnan et al., 2021). |
| Hardware Specification | Yes | Type of GPUs NVIDIA A100 |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies beyond the general mention of models and frameworks like Open CLIP Conv Next, Vicuna-7B, PyTorch, etc. |
| Experiment Setup | Yes | Table A.13: Hyperparameters for the instruction-tuning stage. Hyperparameter Value Optimizer Adam W Weight decay 0.05 Betas [0.9, 0.999] Learning rate 3e-5 Warmup steps 400 Number of workers 4 Parallel strategy DDP Type of GPUs NVIDIA A100 Number of GPUs 4 Accumulate gradient batches 5 Batch size per GPU (total) 4 (80) Training precision bfloat16 Gradient norm 5.0 Epochs 10 |