Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Authors: Zihan Wang, Seungjun Lee, Gim Hee Lee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our monocular VLN system achieves state-of-the-art performance on benchmarks including R2R-CE, REVERIE-CE, and Nav RAG-CE. The results also demonstrate our strong capabilities in pre-exploration, lifelong memory and real-world experiments. Section 4, titled 'Experiments', further details the empirical studies conducted, including comparisons with SOTA methods, pre-exploration and lifelong memory experiments, real-world experiments, computational cost analysis, and an ablation study.
Researcher Affiliation Academia Zihan Wang Seungjun Lee Gim Hee Lee School of Computing, National University of Singapore EMAIL, EMAIL
Pseudocode No The paper describes its methodology in Section 3, titled 'Our Method', using narrative text, architectural diagrams (Figure 2), and mathematical equations. However, it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/Mr Zihan/Dynam3D.
Open Datasets Yes We train the Merging Discriminator using over 5K rooms with 3D instance segmentation data: Scan Net [47], HM3D [48], Matterport3D [22] and 3RScan [49], where the annotation of instances of point clouds are processed for each point with world coordinate and instance ID. To align 3D instances with language semantics, we leverage contrastive learning on large-scale 3D-language pairs from Scene Verse [50] and g3D-LF [15]. To train our 3D-VLM with sufficient navigation data, we transfer datasets generated by Scale VLN [60] and Nav RAG [4] from discrete environments to the continuous Habitat simulator [23].
Dataset Splits Yes As shown in Tables 1 and 2, we evaluate the navigation performance of our Dynam3D across three distinct continuous-environment VLN benchmarks. Specifically, the R2R-CE dataset (Tables 1) provides step-by-step and following instructions. [...] Table 1: Evaluation of VLN on R2R-CE with monocular setting. [...] R2R-CE Val R2R-CE Test [...] Table 2: Evaluation of VLN on REVERIE-CE and Nav RAG-CE with monocular setting. [...] REVERIE-CE Val Nav RAG-CE Val
Hardware Specification Yes We evaluate computational cost on the R2R-CE dataset using a single NVIDIA RTX 4090 GPU. [...] We pre-train our Dynam3D representation model on the aforementioned dataset for 100K episodes (approximately 8 days) using four RTX 6000 Ada GPUs. [...] We pre-train the 3D-VLM model on the navigation datasets for 100K episodes [...] using two RTX 6000 Ada GPUs. [...] The model is deployed on a workstation equipped with an NVIDIA RTX 4090 GPU and 64GB of RAM.
Software Dependencies No The paper mentions that the 3.8 billion-parameter LLa VA-Phi-3-mini [52, 53] is integrated and optimized, and that the Adafactor optimizer [62] is employed with Gradient Checkpointing [63]. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes The training is performed with a batch size of 4 and a learning rate of 1e-4. [...] We pre-train the 3D-VLM model on the navigation datasets for 100K episodes (50K for stage one, 50K for stage two, approximately 9 days) [...] The training is performed with a batch size of 4 and a learning rate of 1e-6. During training, all parameters of the 3.8B LLa VA-Phi-3-mini [52, 53] are optimized, except the generalizable feature field model [15] and the pre-trained Dynam3D representation model. To mitigate memory consumption and enable efficient training of large models, we employ the Adafactor optimizer [62] in conjunction with Gradient Checkpointing [63].