Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Authors: Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the nu Scenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and Li DAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the synthetic data.
Researcher Affiliation Collaboration Xiangyu Guo1*, Zhanqian Wu2*, Kaixin Xiong2*, Ziyang Xu1, Lijun Zhou2, Gangwei Xu1, Shaoqing Xu2, Haiyang Sun2 , Bing Wang2, Guang Chen2, Hangjun Ye2, Wenyu Liu1, Xinggang Wang1 1Huazhong University of Science and Technology 2Xiaomi EV
Pseudocode No The paper describes its methodology in sections 3.1, 3.2, and 3.3, using descriptive text and mathematical formulations (e.g., Eq. 1-8) and architectural diagrams, but does not include a dedicated pseudocode or algorithm block.
Open Source Code No The datasets used in our experiments are publicly available, while the code is not yet released. Detailed experimental procedures are provided to ensure the verifiability of our findings.
Open Datasets Yes Training and evaluation are conducted on the nu Scenes [4] dataset, which includes 1,000 urban driving scenes (700 train / 150 val / 150 test) .[4] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621 11631, 2020.
Dataset Splits Yes Training and evaluation are conducted on the nu Scenes [4] dataset, which includes 1,000 urban driving scenes (700 train / 150 val / 150 test) .
Hardware Specification Yes Training is performed with Py Torch using 64 NVIDIA H20 GPUs and mixed-precision acceleration.
Software Dependencies No Training is performed with Py Torch using 64 NVIDIA H20 GPUs and mixed-precision acceleration.
Experiment Setup Yes A three-stage curriculum is adopted. (1) In the first stage, image-level generation is trained at 512 768 resolution to warm-start spatial representation learning. (2) The second stage focuses on video synthesis, where a two-phase training protocol is employed. Multi-resolution pretraining is first conducted by gradually increasing input resolution from 144p (144 256) to 900p (900 1600), paired with clip lengths ranging from 128 frames at lower resolutions to 6 frames at higher ones. This is followed by adapter-based fine-tuning at a fixed resolution of 360p (360 640) and 16 frames, where lightweight spatiotemporal adapters are inserted into Di T blocks. (3) The third stage performs joint training of video and Li DAR generation with shared conditioning signals, enabling cross-modal temporal alignment. For inference, video frames are generated at 900p with the first frame observed as input, consistent with prior work [8]. Training is performed with Py Torch using 64 NVIDIA H20 GPUs and mixed-precision acceleration. Stages 1, 2, and 3 are trained for 300, 800, and 200 epochs respectively. The optimizer is Adam W with a weight decay of 0.01. A cosine annealing learning rate schedule is adopted with linear warm-up over the first 10% of steps. Learning rates are set to 2 10 4 for the image and video stages, and 1 10 5 for the joint stage. A global batch size of 1024 is used, distributed evenly across GPUs.