Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aux-Think: Exploring Reasoning Strategies for Data-Efficient Vision-Language Navigation

Authors: Shuo Wang, Yongcai Wang, Wanting Li, Xudong Cai, Yucheng Wang, Maiyue Chen, kaihui.wang, Zhizhong Su, Deying Li, Zhaoxin Fan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that Aux-Think substantially reduces training effort and achieves state-of-the-art performance on success rate. In Fig. 6, we evaluate the Success Rate (SR) per test step for Aux-Think (ours), Pre-Think, and Post-Think, with results grouped by the number of steps required for task completion. Our model is trained with 8 NVIDIA H20 GPUs for one epoch (around 60 hours), with a learning rate of 1e-5. We evaluate our method on the VLN-CE benchmarks R2R-CE [34] and Rx R-CE [40] following the standard VLN-CE settings. All the methods are evaluated on the R2R val-unseen split and Rx R val-unseen split.
Researcher Affiliation Collaboration 1Renmin University of China, 2Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, 3Horizon Robotics
Pseudocode No The paper describes methods like No-Think, Pre-Think, Post-Think, and Aux-Think, and provides mathematical formulations for their training losses (Equation 1, 2, 3, 4, 5). It also details the action design (e.g., 'move forward, turn left, turn right, and stop' with specific step sizes and rotation angles). However, it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, step-by-step procedures in a code-like format.
Open Source Code No The paper mentions a project page: 'https://horizonrobotics.github.io/robot_lab/aux-think'. It also states: 'We also release R2R-Co T-320k, the first Co T dataset for VLN, to facilitate future research on reasoning models.' This explicitly refers to the release of the dataset, but there is no direct statement or link indicating the release of the source code for the methodology described in the paper.
Open Datasets Yes To validate the effectiveness of Aux-Think, we introduce R2R-Co T-320k, the first Co T dataset for VLN, which is large-scale and specifically tailored for the R2R-CE benchmark [34]. We also release R2R-Co T-320k, the first Co T dataset for VLN, to facilitate future research on reasoning models.
Dataset Splits Yes We evaluate our method on the VLN-CE benchmarks R2R-CE [34] and Rx R-CE [40] following the standard VLN-CE settings. All the methods are evaluated on the R2R val-unseen split and Rx R val-unseen split.
Hardware Specification Yes Our model is trained with 8 NVIDIA H20 GPUs for one epoch (around 60 hours), with a learning rate of 1e-5.
Software Dependencies No The paper mentions using specific models like "NVILA-lite 8B [17]", "Sig LIP [58]", "Qwen 2 [16]", and "Qwen-2.5-VL-72B [16]". While these are software components, the paper does not specify general programming languages, libraries, or solvers with their explicit version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x, etc.) that would be necessary for a reproducible description of ancillary software.
Experiment Setup Yes Model training. We use NVILA-lite 8B [17] as the base pretrained model, which consists of a vision encoder (Sig LIP [58]), a projector, and an LLM (Qwen 2 [16]). We use supervised finetuning (SFT) to train our VLN model from stage 2 of NVILA-lite, as it has finished visual language corpus pre-training. Our model is trained with 8 NVIDIA H20 GPUs for one epoch (around 60 hours), with a learning rate of 1e-5. Action design. The action space is designed into four categories: move forward, turn left, turn right, and stop. The forward action includes step sizes of 25 cm, 50 cm, and 75 cm, while the turn actions are parameterized by rotation angles of 15 , 30 , and 45 .