reproducibilityindex.ai

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Authors: Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks.
Researcher Affiliation	Academia	Zaijing Li1 2, Yuquan Xie1, Rui Shao1 , Gongwei Chen1, Dongmei Jiang2, Liqiang Nie1 1Harbin Institute of Technology, Shenzhen 2Peng Cheng Laboratory
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Please see the project page at https://cybertronagent.github.io/Optimus1.github.io/.
Open Datasets	Yes	Environment. To ensure realistic gameplay like human players, we employ Mine RL [11] with Minecraft 1.16.5 as our simulation environment. Benchmark. We constructed a benchmark of 67 tasks to evaluate the Optimus-1 s ability to complete long-horizon tasks. As illustrated in Table 5, we divide the 67 Minecraft tasks into 7 groups according to recommended categories in Minecraft. Please refer to Appendix D for more details.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for validation.
Hardware Specification	Yes	All experiments were implemented on 4x NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions specific models like GPT-4V, STEVE-1 [25], Deepseek-VL [26], Intern LM-XComposer2-VL [6], and Mine CLIP [7], but does not provide specific version numbers for underlying software dependencies or libraries.
Experiment Setup	Yes	We conduct extensive ablation experiments on 18 tasks, experiment setting can be found in Table 6. The agent always starts in survival mode, with an empty inventory. We conducted at least 30 times for each task using different world seeds and reported the average success rate to ensure fair and thorough evaluation. Additionally, we add the average steps and average time of completing the task as evaluation metrics.