Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

Authors: Junting Chen, Haotian Liang, Lingxiao Du, Weiyun Wang, Mengkang Hu, Yao Mu, Wenhai Wang, Jifeng Dai, Ping Luo, Wenqi Shao, Lin Shao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://hhyhrhy.github.io/owmm-agent-project. 5 Experiments In this section, we present the evaluation results in both simulation and real-world data. We present the experimental results of single-step evaluation for OWMM-VLM in our simulated benchmark in section 5.1 and episodic evaluation for the OWMM-Agent in our simulated benchmark in section 5.2. We then present the real-world evaluation in section 5.3.
Researcher Affiliation Collaboration 1Shanghai AI Laboratory 2School of Computing, National University of Singapore 3USTC 4The Univeristy of Hongkong 5Shanghai Jiaotong University 6 Tsinghua University
Pseudocode No The paper describes methods and pipelines in text and diagrams (e.g., Figure 2, Section 3 Methodology, Section 4.1 Agentic Data Synthesis Pipeline) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The project page is at https://hhyhrhy.github.io/owmm-agent-project. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The data and code are open-sourced and anonymous in the given link in abstract.
Open Datasets Yes We used 143 scenes from The Habitat Synthetic Scenes Dataset (HSSD) [13] and combined objects from YCB Objects [3] and Google Scanned Objects [6] to create a dataset with 157 unique manipulation objects and 1,471 receptacles from selected scenes.
Dataset Splits Yes We partitioned the scenes into training and testing sets using a ratio of 113:30. Besides, we allocated 157 objects between the training and validation sets with a ratio of 137:20, ensuring that the testing set contained entirely unseen objects. This division resulted in a total of 152k training data entries and 4k testing data entries, establishing a robust dataset for training and testing in our OWMM task.
Hardware Specification Yes OWMM-VLM-8B is trained on 8X NVIDIA A100 GPUs for about 7 hours, and OWMM-VLM-38B is trained on 24X NVIDIA A100 GPUs for about 18 hours. OWMM-VLM-8B: Single-step inference on a single A100-40G GPU with 2+1, 4+1, 8+1, and 16+1 frames (posed frames + egocentric frame). OWMM-VLM-38B: Single-step inference on 4 A100-40G GPUs with 8+1, 16+1, 32+1, and 64+1 frames using parallel inference.
Software Dependencies No The paper mentions several software components, models, and frameworks like "Intern VL-2.5[5]", "Home Robot [40] framework", "Robi Butler[35]", and "Gmapping [9] algorithm". However, it does not provide specific version numbers for general-purpose programming languages, libraries, or development environments such as Python, PyTorch, or CUDA, which are typically required for reproducible software descriptions.
Experiment Setup Yes The OWMM-VLM model is trained to autoregressively generate the response tokens consisting of the output action and its corresponding task context in JSON format. Specifically, we freeze the parameters in Vi T and only adjust the parameters in MLP and LLM. As for the training time, OWMM-VLM-8B is trained on 8X NVIDIA A100 GPUs for about 7 hours, and OWMM-VLM-38B is trained on 24X NVIDIA A100 GPUs for about 18 hours. Both our models were trained for 1 epoch. For PIVOT, we configured the following parameters: n_samples_init=10, n_samples_opt=6, n_iters=2. In our evaluation settings, as the input consists of a single RGB image and task instructions, we randomly sample initial points in the image from a 2D Gaussian distribution. The distribution is parameterized with a mean of (256, 256) and standard deviation of (100, 100).