Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks.
Researcher Affiliation Academia Zaijing Li1 2, Yuquan Xie1, Rui Shao1 , Gongwei Chen1, Dongmei Jiang2, Liqiang Nie1 1Harbin Institute of Technology, Shenzhen 2Peng Cheng Laboratory
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Please see the project page at https://cybertronagent.github.io/Optimus1.github.io/.
Open Datasets Yes Environment. To ensure realistic gameplay like human players, we employ Mine RL [11] with Minecraft 1.16.5 as our simulation environment. Benchmark. We constructed a benchmark of 67 tasks to evaluate the Optimus-1 s ability to complete long-horizon tasks. As illustrated in Table 5, we divide the 67 Minecraft tasks into 7 groups according to recommended categories in Minecraft. Please refer to Appendix D for more details.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for validation.
Hardware Specification Yes All experiments were implemented on 4x NVIDIA A100 GPUs.
Software Dependencies No The paper mentions specific models like GPT-4V, STEVE-1 [25], Deepseek-VL [26], Intern LM-XComposer2-VL [6], and Mine CLIP [7], but does not provide specific version numbers for underlying software dependencies or libraries.
Experiment Setup Yes We conduct extensive ablation experiments on 18 tasks, experiment setting can be found in Table 6. The agent always starts in survival mode, with an empty inventory. We conducted at least 30 times for each task using different world seeds and reported the average success rate to ensure fair and thorough evaluation. Additionally, we add the average steps and average time of completing the task as evaluation metrics.