Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving
Authors: Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, ZHAO-XIANG ZHANG
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on the NAVSIM benchmark demonstrate that Drive DPO achieves a new state-of-the-art PDMS of 90.0. Furthermore, qualitative results across diverse challenging scenarios highlight Drive DPO s ability to produce safer and more reliable driving behaviors. We conduct comprehensive experiments on the NAVSIM benchmark and achieve a new state-of-the-art PDMS of 90.0, significantly advancing performance across multiple safety-critical metrics. |
| Researcher Affiliation | Collaboration | Shuyao Shang1,2 Yuntao Chen3 Yuqi Wang1,2 Yingyan Li1,2 Zhaoxiang Zhang1,2 1NLPR, Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences 3Miro Mind EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using mathematical formulas and textual explanations, for example, for LDPO and punified, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code will be open-sourced upon paper acceptance. |
| Open Datasets | Yes | We evaluate the proposed framework on the NAVSIM benchmark [12] along with Bench2Drive benchmark [13]. NAVSIM NAVSIM [12] benchmark combines real-world sensor data with a non-interactive simulation mechanism, which is built upon Open Scene [57], a reprocessed version of the nu Plan dataset [58]. Bench2Drive Bench2Drive [13] is a closed-loop evaluation benchmark for end-to-end autonomous driving. |
| Dataset Splits | Yes | For each frame, the NAVSIM dataset provides eight high-resolution camera images and fused point cloud data sampled at 2 Hz. ... The final standardized training set, Navtrain, contains approximately 103,000 samples, and the test set, Navtest, contains around 12,000 samples. ... The official training set contains approximately 13,638 short clips, covering 44 categories of interactive scenarios, 23 weather conditions, 12 towns, and a full sensor suite. The evaluation set is organized into 220 short routes that assess various interaction capabilities under different towns and weather. |
| Hardware Specification | Yes | All experiments are conducted on 6 NVIDIA L20 GPUs, with a batch size of 16 per GPU. |
| Software Dependencies | No | The paper mentions software components like Transfuser [50] and Adam W optimizer [59] but does not provide specific version numbers for any libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | The size of the anchor vocabulary is set to N = 8192. We use fixed weights w1 = 0.1 and w2 = 1.0 for unified policy distillation. The number of frequency bands in the positional encoding is set to L = 10. The predefined safety threshold τ is set to 0.3. All experiments are conducted on 6 NVIDIA L20 GPUs, with a batch size of 16 per GPU. We use the Adam W optimizer [59] with a learning rate of 1e 4. The model is first trained for 30 epochs using unified policy distillation, followed by 10 epochs of fine-tuning with Safety DPO. We sample K = 1024 trajectories from the policy distribution for each DPO iteration and set the β = 0.1. In DPO training, inspired by [11], we introduce an explicit KL regularization term to suppress distributional drift during training. Finally, similar to [35], we continue applying the KL loss from unified policy distillation during the DPO fine-tuning stage as an auxiliary loss. |