Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization

Authors: Frank Röder, Jan Benad, Manfred Eppe, Pradeep Banerjee

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sections 6.1 and 6.2, we evaluate DALI s ability to generalize in a zero-shot manner across unseen context variations. In Section 6.3, we show that the dynamics-aligned context encoder learns a structured latent representation, where perturbations to individual dimensions produce physically plausible counterfactuals (e.g., shorter ball swings for higher imagined gravity). We evaluate DALI s zero-shot generalization on contextualized DMC Ball-in-Cup and Walker Walk tasks from the CARL benchmark [Benjamins et al., 2023]. We report IQM and Po I across 10 seeds in Figure 1, comparing DALI-S, DALI-S-χ, Dreamer-DR, c RSSM-S, and c RSSM-D for Featurized and Pixel observations on the Ball-in-Cup and Walker Walk tasks.
Researcher Affiliation	Academia	Frank Röder Jan Benad Manfred Eppe Pradeep Kr. Banerjee Institute for Data Science Foundations, Blohmstraße 15, 21079 Hamburg, Germany. JB and ME gratefully acknowledge funding by the German Research Foundation DFG through the Mo Re Space (402776968) project.
Pseudocode	Yes	For detailed pseudocode, see Algorithms 1 and 2 in Appendix B for Shallow Integration, and Algorithms 3 and 4 for Deep Integration.
Open Source Code	Yes	Our code is available at https://github.com/frankroeder/DALI.
Open Datasets	Yes	We evaluate DALI s zero-shot generalization on contextualized DMC Ball-in-Cup and Walker Walk tasks from the CARL benchmark [Benjamins et al., 2023]. The DeepMind Control Suite (DMC) [Tassa et al., 2018].
Dataset Splits	Yes	To formalize zero-shot generalization, we define two distributions over contexts: the training distribution ptrain(c), from which contexts are sampled during training, and the evaluation distribution peval(c), representing unseen test contexts. We assess performance under three generalization regimes [Kirk et al., 2023]: Interpolation (contexts within the training range), Extrapolation (OOD contexts beyond the training range), and Mixed (one context OOD, one within training range). For Ball-in-Cup, the context parameters are gravity (training: [4.9, 14.7], evaluation: [0.98, 4.9) (14.7, 19.6], default: 9.81) and string length (training: [0.15, 0.45], evaluation: [0.03, 0.15) (0.45, 0.6], default: 0.3).
Hardware Specification	Yes	Training and evaluation of the baselines and our DALI approaches were conducted on NVIDIA A100 GPUs with 80GB of VRAM and Intel Xeon Platinum 8352V CPUs. Typically, our setup provides access to 2 GPUs on average, with up to 4 GPUs available in the best-case scenario.
Software Dependencies	No	We adopt the small Dreamer V3 variant with hyperparameters from Hafner et al. [2025], following the setup of Prasanna et al. [2024] to ensure fair and reproducible comparison with their c RSSM-S/D baselines. DALI adds only about 4% parameter overhead (e.g., Dreamer-DR: 15.73M vs. DALI-S: 16.45M) while consistently improving performance with minimal additional complexity. (The text mentions software/frameworks like Dreamer V3 but does not provide specific version numbers for dependencies like PyTorch, TensorFlow, etc.)
Experiment Setup	Yes	Setup. We train our methods using the small variant of Dreamer V3 [Hafner et al., 2025], with a transformer-based context encoder [Vaswani et al., 2017], for 200K timesteps (Ball-in-Cup) or 500K timesteps (Walker) across 10 random seeds, following the setup of Prasanna et al. [2024]. Hyperparameters and architectural details are provided in Appendix C. The context encoder gϕ employs a standard transformer encoder block [Vaswani et al., 2017] to process a sequence of K transitions, (ot K:t, at K:t 1), and produce the context representation zt R8.