Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

NavBench: Probing Multimodal Large Language Models for Embodied Navigation

Authors: Yanyuan Qiao, Haodong Hong, Wenqi Lyu, Dong An, Siqi Zhang, Yutong Xie, Xinyu Wang, Qi Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present Nav Bench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. Nav Bench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases.
Researcher Affiliation Academia Yanyuan Qiao1 Haodong Hong23 Wenqi Lyu4 Dong An5 Siqi Zhang6 Yutong Xie5 Xinyu Wang4 Qi Wu4 1Swiss Federal Institute of Technology Lausanne (EPFL) 2The University of Queensland 3CSIRO Data61 4The University of Adelaide 5Mohamed bin Zayed University of Artificial Intelligence 6Tongji University
Pseudocode No The paper only describes methodologies and calculations through prose and mathematical equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We will release the dataset and code. Anonymized supplementary material includes reproduction instructions. (Details in Supplementary)
Open Datasets Yes Nav Bench is constructed by reorganizing and enriching fine-grained navigation data with multimodal observations to enable zero-shot evaluation of MLLMs. We start by collecting instruction-trajectory pairs from multiple embodied navigation benchmarks, including R2R [9], Rx R [30], GEL-R2R [56], and FGR2R [57]. We will release the dataset and code.
Dataset Splits Yes Nav Bench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. Based on the final scores, each case is categorized into one of three levels, as illustrated in Figure 4: Easy (score 1 3): Short paths with simple instructions, few steps, minimal spatial reasoning, and clear landmarks. Medium (score 4 6): Instructions with moderate length, multiple landmarks or spatial phrases, and medium-length paths. Hard (score 7 9): Long trajectories guided by complex multi-step instructions, often involving floor transitions and multiple spatial references.
Hardware Specification Yes open-source models are deployed using v LLM [64] and lmdeploy [65] on a single NVIDIA A6000 GPU (48GB). For real-world deployment, we integrate our pipeline with a dual-arm composite mobile robot equipped with an Intel Real Sense D435 camera and a Water Drop 2 wheeled base.
Software Dependencies No Proprietary models are accessed via APIs, while open-source models are deployed using v LLM [64] and lmdeploy [65]... The specific version numbers for v LLM and lmdeploy are not provided.
Experiment Setup Yes We evaluate the navigation capabilities of MLLMs by decomposing the task into two core components: Navigation Comprehension... and Navigation Execution... For multiple-choice questions, we follow standard practice [5] and use Accuracy as the primary metric, which measures whether the model selects the correct answer from a set of candidates based on the provided information. For execution tasks, we adopt standard metrics in embodied navigation [9, 30]. Success Rate (SR) measures the percentage of episodes where the target object is visible from the agent s final viewpoint, defined as being within a 3-meter radius. Success weighted by Path Length (SPL) adjusts SR by path efficiency and is computed as: [Equation 6]. We conduct this evaluation in a zero-shot setting [54] within the Matterport3D simulator [55].