Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Overcoming Challenges of Long-Horizon Prediction in Driving World Models

Authors: Arian Mousakhan, Sudhanshu Mittal, Silvio Galesso, Karim Farid, Thomas Brox

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our model yields state-of-the-art performance, despite having only 469M parameters and being trained on 280h of video data. It particularly stands out in difficult scenarios like turning maneuvers and urban traffic. We test whether discrete token models possibly have advantages over continuous models based on flow matching. To this end, we set up a hybrid tokenizer that is compatible with both approaches and allows for a side-by-side comparison. Our study concludes in favor of the continuous autoregressive model, which is less brittle on individual design choices and more powerful than the model built on discrete tokens.
Researcher Affiliation Academia University of Freiburg, Germany EMAIL
Pseudocode No The paper describes its methodology using textual descriptions and figures (e.g., Figure 3 illustrating the Image Tokenizer and World Model architectures). It does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project page with code, model checkpoints and visualization can be found here: https://lmb-freiburg.github.io/orbis.github.io
Open Datasets Yes To train our world model, we use subsets of videos from the BDD100K [84] and Open DV [82] datasets. ... To train the tokenizer we additionally select images from Honda HAD [35], Honda HDD [63], ONCE [54], Nu Scenes [8], and Nu Plan [9] to make the dataset diverse. ... All the datasets used in this work are publicly available.
Dataset Splits Yes To train our world model, we use subsets of videos from the BDD100K [84] and Open DV [82] datasets. As shown in Table 1, we select a limited number of hours from each dataset and extract frames at 10 Hz. In total, we use 280 hours of video data from a combined available total of 2747 hours. ... For BDD100K, we select the dayclear subset of the training set. ... For this benchmark, we use the validation set of nu Plan [9]... The total resulting samples are 5878. Due to the computational cost of generating videos for all approaches we evaluate on the first 800 samples. ... For this benchmark, we use the validation set of nu Plan [9]. ... We evaluate on 400 of the resulting 416 samples. ... We use 400 of the resulting 406 samples selected with these criteria.
Hardware Specification Yes All small-scale models for ablation studies are trained on only the BDD100K subset for one day on 32 A100 GPUs. The higher resolution model is trained for 10 epochs over 5 days on 72 A100 GPUs. ... The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS [34] at Jรผlich Supercomputing Centre (JSC).
Software Dependencies No The paper mentions various architectures (e.g., Transformer, CNN), models (e.g., Di T, STDi T, Swin Transformer), and optimizers (Adam W), but does not provide specific version numbers for software libraries, programming languages, or other dependencies (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes Our higher resolution model operates at 512 x 288 and small-scale model at 256 x 256. Tokenizer compresses the image spatially by 16 . We train latent models with a context of 5 frames sampled at 5Hz. ... We train models with a context of 5 frames sampled at 5Hz, using the Adam W [51] optimizer with a learning rate of 5 x 10-5. ... The higher resolution model is trained for 10 epochs over 5 days on 72 A100 GPUs. ... For the FM model, ... To improve generalization and frame generation quality, we drop all context frames 50% of the time. ... In order to sample the next frame, we use ODE sampler and take 30 steps. For the MGM model, ... we replace 10% of the frames and 10% of the overall tokens with a mask token. ... We use T = 2, T = 0.2 and ฮป = 0.5 for our experiments. ... For the final model, we train the quantized version of the image tokenizer with codebook size of 16384 for each codebook. The model training has three phases. ... Three phases in total comprise of 20 epochs of training.