Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STRIDER: Navigation via Instruction-Aligned Structural Decision Space Optimization
Authors: Diqi He, Xuehao Gao, Hao Li, Junwei Han, Dingwen Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the R2R-CE and Rx R-CE benchmarks demonstrate that STRIDER significantly outperforms strong SOTA across key metrics; in particular, it improves Success Rate (SR) from 29% to 35%, a relative gain of 20.7%. Such results highlight the importance of spatially constrained decision-making and feedback-guided execution in improving navigation fidelity for zero-shot VLN-CE. |
| Researcher Affiliation | Academia | Diqi He1 , Xuehao Gao1 , Hao Li1,2, Junwei Han1,3, Dingwen Zhang1 1Northwestern Polytechnical University 2Nanyang Technological University 3Chongqing University of Posts and Telecommunications |
| Pseudocode | No | The paper describes the methodology in prose and figures (e.g., Figure 2 for an overview, Figure 3 for waypoint selection), but does not include a clearly labeled pseudocode or algorithm block with structured steps formatted like code. |
| Open Source Code | Yes | https://github.com/diqihe666/STRIDER-Nav |
| Open Datasets | Yes | We conduct experiments on the R2R-CE dataset, which extends the Room-to Room (R2R) benchmark for visual language navigation (VLN) [3, 25]. This dataset consists of natural language instructions paired with navigation trajectories in realistic 3D indoor environments, derived from the Matterport3D dataset [4]. We also use the Rx R-CE dataset, which extends the Room-Across-Room (Rx R) benchmark with similar challenging conditions [26, 25]. |
| Dataset Splits | No | We follow the settings of Open Nav [42], conducting tests on 100 randomly selected episodes from the dataset. In these experiments, we leverage both VLM and LLM to perform zero-shot navigation. |
| Hardware Specification | No | Our VLM and LLM are accessed via API rather than deployed locally; for local deployment using open-source models, please refer to Open-Nav [42]. The paper does not specify the hardware (e.g., GPU/CPU models) used to run the experiments, only that APIs were used for VLM/LLM. |
| Software Dependencies | No | For perception and feedback generation, we use Qwen-VL-Max as the Vision-Language Model (VLM). The action selection process is guided by GPT-4o, which reasons over the instruction, structured perception, and feedback to choose the next waypoint. The paper lists specific VLM/LLM models used but does not provide details on other ancillary software like programming languages, frameworks, or operating system versions. |
| Experiment Setup | Yes | At each step, the agent receives an RGB-D observation, where the RGB input is resized to 244 244 3 and the depth map to 256 256. Structured waypoints are generated by extracting skeletons from depth without relying on any pretrained waypoint predictor. For perception and feedback generation, we use Qwen-VL-Max as the Vision-Language Model (VLM). The action selection process is guided by GPT-4o, which reasons over the instruction, structured perception, and feedback to choose the next waypoint. |