Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks
Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. |
| Researcher Affiliation | Academia | Zaijing Li1 2, Yuquan Xie1, Rui Shao1 , Gongwei Chen1, Dongmei Jiang2, Liqiang Nie1 1Harbin Institute of Technology, Shenzhen 2Peng Cheng Laboratory |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Please see the project page at https://cybertronagent.github.io/Optimus1.github.io/. |
| Open Datasets | Yes | Environment. To ensure realistic gameplay like human players, we employ Mine RL [11] with Minecraft 1.16.5 as our simulation environment. Benchmark. We constructed a benchmark of 67 tasks to evaluate the Optimus-1 s ability to complete long-horizon tasks. As illustrated in Table 5, we divide the 67 Minecraft tasks into 7 groups according to recommended categories in Minecraft. Please refer to Appendix D for more details. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for validation. |
| Hardware Specification | Yes | All experiments were implemented on 4x NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions specific models like GPT-4V, STEVE-1 [25], Deepseek-VL [26], Intern LM-XComposer2-VL [6], and Mine CLIP [7], but does not provide specific version numbers for underlying software dependencies or libraries. |
| Experiment Setup | Yes | We conduct extensive ablation experiments on 18 tasks, experiment setting can be found in Table 6. The agent always starts in survival mode, with an empty inventory. We conducted at least 30 times for each task using different world seeds and reported the average success rate to ensure fair and thorough evaluation. Additionally, we add the average steps and average time of completing the task as evaluation metrics. |