SimGen: Simulator-conditioned Driving Scene Generation

Authors: Yunsong Zhou, Michael Simon, Zhenghao (Mark) Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We further demonstrate the improvements brought by Sim Gen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation.
Researcher Affiliation Academia 1 University of California, Los Angeles 2 Shanghai Jiao Tong University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our codes and data are available in https://github.com/metadriverse/ Sim Gen, and we show full implementation details in Appendix C.
Open Datasets Yes A driving video dataset DIVA is collected to enhance the generative diversity of Sim Gen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the Meta Drive simulator.
Dataset Splits Yes The nu Scenes dataset [6] is a public driving dataset that includes 1000 scenes from Boston and Singapore for diverse driving tasks [87, 42, 41]. Each scene comprises a 20-second video, approximately 40 frames. It provides 700 training scenes, 150 validation scenes, and 150 test scenes.
Hardware Specification Yes The default GPUs in most of our experiments are NVIDIA Tesla A6000 devices unless otherwise specified.
Software Dependencies Yes Concretely, we utilize Stable Diffusion 2.1 (SD-2.1) [60], a large-scale latent diffusion model for text-to-image generation. It is implemented as a denoising UNet, denoted by ϵθ, with multiple stacked convolutional and attention blocks, which learns to synthesize images by denoising latent noise.
Experiment Setup Yes It is trained on 4.5M text-depth-segmentation pairs of DIVA-Real and nu Scenes. We train the model for 30K iterations on 8 GPUs with a batch size of 96 with Adam W [43]. We linearly warm up the learning rate for 103 steps in the beginning, then keep it constant at 1 × 10−5.