Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SimGen: Simulator-conditioned Driving Scene Generation
Authors: Yunsong Zhou, Michael Simon, Zhenghao (Mark) Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further demonstrate the improvements brought by Sim Gen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation. |
| Researcher Affiliation | Academia | 1 University of California, Los Angeles 2 Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes and data are available in https://github.com/metadriverse/ Sim Gen, and we show full implementation details in Appendix C. |
| Open Datasets | Yes | A driving video dataset DIVA is collected to enhance the generative diversity of Sim Gen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the Meta Drive simulator. |
| Dataset Splits | Yes | The nu Scenes dataset [6] is a public driving dataset that includes 1000 scenes from Boston and Singapore for diverse driving tasks [87, 42, 41]. Each scene comprises a 20-second video, approximately 40 frames. It provides 700 training scenes, 150 validation scenes, and 150 test scenes. |
| Hardware Specification | Yes | The default GPUs in most of our experiments are NVIDIA Tesla A6000 devices unless otherwise specified. |
| Software Dependencies | Yes | Concretely, we utilize Stable Diffusion 2.1 (SD-2.1) [60], a large-scale latent diffusion model for text-to-image generation. It is implemented as a denoising UNet, denoted by ϵθ, with multiple stacked convolutional and attention blocks, which learns to synthesize images by denoising latent noise. |
| Experiment Setup | Yes | It is trained on 4.5M text-depth-segmentation pairs of DIVA-Real and nu Scenes. We train the model for 30K iterations on 8 GPUs with a batch size of 96 with Adam W [43]. We linearly warm up the learning rate for 103 steps in the beginning, then keep it constant at 1 × 10−5. |