Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Imagined Autocurricula

Authors: Ahmet Hamdi Güzel, Matthew T. Jackson, Jarek Liesen, Tim Rocktäschel, Jakob Foerster, Ilija Bogunovic, Jack Parker-Holder

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contribution lies not in advancing world model architectures, but in demonstrating how UED principles can effectively guide agent training within learned world models from offline mixed dataset. Our approach IMAC uses Prioritized Level Replay (PLR, [24]) as a UED algorithm, which we show provides a natural complement to the learned world model the world model generates diverse potential training trajectories or "imagined environments," while PLR strategically selects subsequent training tasks from these imagined rollouts. Figure 1 illustrates the overall architecture of our approach. 4 Experiments 4.1 Procgen Benchmark For comprehensive evaluation of our approach, we use a challenging subset of the Procgen Benchmark [38], a collection of procedurally generated environments designed specifically to test generalization in reinforcement learning. Table 1: Generalization results: All values represent the mean return over three random seeds standard deviation when transferring agents to held out levels on the Procgen benchmark. Table 2: World model ablation: Performance evaluation on the Procgen benchmark all values represent mean return over three random seeds standard deviation The results in Table 1 demonstrate that IMAC method consistently outperforms state-of-the-art offline RL algorithms across all evaluated Procgen environments.
Researcher Affiliation	Academia	Ahmet H. Güzel University College London AI Centre Matthew T. Jackson University of Oxford Jarek L. Liesen University of Oxford Tim Rocktäschel University College London AI Centre Jakob N. Foerster University of Oxford Ilija Bogunovic University College London AI Centre Jack Parker-Holder University College London AI Centre Corresponding author (EMAIL).
Pseudocode	Yes	Algorithm 1: i Mac (Sequential Offline Training) Procedure training_loop(D) Algorithm 2: i Mac: Actor-Critic Training with PLR (Prioritized Level Replay) Procedure train_actor_critic_PLR(D, B)
Open Source Code	No	5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The pipeline uses already open-sourced codebases, and the remainder of the code will be released on Git Hub shortly after this submission.
Open Datasets	Yes	4.1 Procgen Benchmark For comprehensive evaluation of our approach, we use a challenging subset of the Procgen Benchmark [38], a collection of procedurally generated environments designed specifically to test generalization in reinforcement learning.
Dataset Splits	Yes	We constructed our offline dataset by collecting trajectories from 200 procedurally generated levels for each of the five Procgen environments (Coin Run, Ninja, Jumper, Maze, and Cave Flyer), totaling 1 million environment steps per game. We explicitly constrained data collection to levels 0-199 for each environment, ensuring complete isolation from the test levels (200+) used for evaluation, thereby preventing any potential data leakage and maintaining a strict train-test split that properly assesses generalization to truly unseen procedurally generated scenarios.
Hardware Specification	Yes	After data collection, we trained the world model and reward/termination predictors (requiring approximately 10 hours on an NVIDIA RTX 4090), then froze these components for generating imagined rollouts during agent training. Each agent seed’s training required approximately 4 days on a single RTX 4090, with experiments conducted across 3 random seeds for 5 million total steps each, resulting in 180 GPU days of computation. Table 6: Detailed computational time analysis for IMAC training pipeline. Measurements were obtained using an NVIDIA RTX 4090 GPU with hyperparameters detailed in previous sections.
Software Dependencies	No	The paper does not provide specific software names with version numbers for its dependencies. While it mentions architectures and frameworks (e.g., U-Net, CNN, LSTM, Adam W optimizer), it does not specify versions like 'PyTorch 1.x' or 'TensorFlow 2.y'.
Experiment Setup	Yes	Details of model architecture, algorithm, training hyperparameters used, and dataset details for this work are given in Appendices A and B, C, and D. Please refer to Table 3 below for hyperparameter values. Table 3: Architecture Parameters for i MAC. Hyperparameter Value Table 4: Hyperparameters for IMAC. Hyperparameter Value