SimGen: Simulator-conditioned Driving Scene Generation
Authors: Yunsong Zhou, Michael Simon, Zhenghao (Mark) Peng, Sicheng Mo, Hongzi Zhu, Minyi Guo, Bolei Zhou
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We further demonstrate the improvements brought by Sim Gen for synthetic data augmentation on the BEV detection and segmentation task and showcase its capability in safety-critical data generation. |
| Researcher Affiliation | Academia | 1 University of California, Los Angeles 2 Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our codes and data are available in https://github.com/metadriverse/ Sim Gen, and we show full implementation details in Appendix C. |
| Open Datasets | Yes | A driving video dataset DIVA is collected to enhance the generative diversity of Sim Gen, which contains over 147.5 hours of real-world driving videos from 73 locations worldwide and simulated driving data from the Meta Drive simulator. |
| Dataset Splits | Yes | The nu Scenes dataset [6] is a public driving dataset that includes 1000 scenes from Boston and Singapore for diverse driving tasks [87, 42, 41]. Each scene comprises a 20-second video, approximately 40 frames. It provides 700 training scenes, 150 validation scenes, and 150 test scenes. |
| Hardware Specification | Yes | The default GPUs in most of our experiments are NVIDIA Tesla A6000 devices unless otherwise specified. |
| Software Dependencies | Yes | Concretely, we utilize Stable Diffusion 2.1 (SD-2.1) [60], a large-scale latent diffusion model for text-to-image generation. It is implemented as a denoising UNet, denoted by ϵθ, with multiple stacked convolutional and attention blocks, which learns to synthesize images by denoising latent noise. |
| Experiment Setup | Yes | It is trained on 4.5M text-depth-segmentation pairs of DIVA-Real and nu Scenes. We train the model for 30K iterations on 8 GPUs with a batch size of 96 with Adam W [43]. We linearly warm up the learning rate for 103 steps in the beginning, then keep it constant at 1 × 10−5. |