Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Generative Planning with 3D-Vision Language Pre-training for End-to-End Autonomous Driving
Authors: Tengpeng Li, Hanli Wang, Xianfei Li, Wenlong Liao, Tao He, Pai Peng
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the challenging nu Scenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. ... Substantial experiments are conducted on the complex public nu Scenes (Caesar et al. 2020) dataset... The ablation study in Table 3 systematically investigates the contributions of the key components of GPVL on the nu Scenes dataset. |
| Researcher Affiliation | Collaboration | 1College of Electronic and Information Engineering, Tongji University, Shanghai, China 2School of Computer Science and Technology, Tongji University, Shanghai, China 3COWAROBOT, China 4School of Electronic Engineering, University of South China, Hunan, China EMAIL, pengpai EMAIL |
| Pseudocode | No | The paper describes methods and formulations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using pre-trained models like BEVformer, BLIP, and BERT, but does not provide any statement or link for the open-source code of their proposed GPVL methodology. |
| Open Datasets | Yes | Substantial experiments are conducted on the complex public nu Scenes (Caesar et al. 2020) dataset, which comprises 1,000 traffic scenarios, and the duration of each video is around 20 seconds. This dataset offers over 1.4 million 3D bounding boxes across 23 different object categories. |
| Dataset Splits | Yes | Experiments on the challenging nu Scenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. ... We train and test the models on datasets constructed from two different urban environments (i.e., Boston and Singapore). Specifically, two groups of experiments are introduced: (1) training on Boston and testing on Singapore, (2) training on Singapore and testing on Boston. ... In the nu Scenes dataset, 87.7% training and 88.2% validation samples consist of simple go straight scenes. |
| Hardware Specification | Yes | The proposed model is trained on Py Torch framework with 8 NVIDIA RTX A6000 cards. |
| Software Dependencies | No | The paper mentions using PyTorch framework, BERT structure, and Adam W optimizer, but no specific version numbers are provided for these software components. |
| Experiment Setup | Yes | The proposed model aims to predict the trajectory for the future 3 seconds. The input image size is 1280 720. GPVL utilizes Res Net50 (He et al. 2016) to extract the multi-view image features. The numbers of BEV queries, bounding boxes and map points are 200 200, 200 and 100 20, respectively. The feature dimension and hidden size are 768 and 512, respectively. The model utilizes the Adam W (Loshchilov and Hutter 2017) optimizer and weight decay 0.01 in the training process. The learning rates in three training stages are 2 10 4, 1 10 4 and 5 10 6, respectively. The BERT (Devlin et al. 2019) structure is used by 3D-vision language pre-training and cross-modal language model. As for testing, the size of greedy search is set to 1 to generate the trajectory caption. |