Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Authors: Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on trajectory planning, future frames generation, and scene understanding tasks demonstrate the effectiveness of pre-training paradigm and spatio-temporal Co T in FSDrive. FSDrive achieves road scene comprehension by establishing pixel-level embodied associations with the environment, rather than relying on human-designed abstract linguistic symbols, advancing autonomous driving towards visual reasoning.
Researcher Affiliation Collaboration Shuang Zeng1, 2 *, Xinyuan Chang2, Mengwei Xie2, Xinran Liu2, Yifan Bai1, 3, Zheng Pan2, Mu Xu2, Xing Wei1 1Xi an Jiaotong University 2Amap, Alibaba Group 3DAMO Academy, Alibaba Group EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes its proposed method FSDrive using narrative text, mathematical equations (Equations 1-6), and an architectural diagram (Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/MIV-XJTU/FSDrive.
Open Datasets Yes Extensive experiments on trajectory planning, future frames generation, and scene understanding tasks demonstrate the effectiveness of pre-training paradigm and spatio-temporal Co T in FSDrive. ...On nu Scenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and Uni AD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. ...Following the previous methods [29, 13, 4], we evaluate trajectory planning and future frames generation on the nu Scenes [1]. ...Additionally, we conducted experiments on NAVSIM [10], a realistic scenario dataset designed for real-world planning. ...Following the previous methods [7, 64], we evaluate scene understanding on Drive LM [54].
Dataset Splits Yes The nu Scenes contains 1,000 scenes of approximately 20 seconds each captured by a 32-beam Li DAR and six cameras providing 360-degree field of view. Specifically, The dataset provides 28,130 (train), 6,019 (val), and 193,082 (unannotated) samples.
Hardware Specification Yes During fine-tuning (12 epochs on 8 NVIDIA RTX A6000), we use 1 10 4 learning rate and batch size of 16.
Software Dependencies No The paper mentions initializing the model with 'Qwen2-VL-2B [63]' and using 'Mo VQGAN [92]' for the visual codebook. However, it does not specify version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries, which are essential for full reproducibility.
Experiment Setup Yes We initialize our model with Qwen2-VL-2B [63] and pre-train it for 32 epochs to enable visual generation while preserving semantic understanding. During fine-tuning (12 epochs on 8 NVIDIA RTX A6000), we use 1 10 4 learning rate and batch size of 16.