Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Transformer-based World Models with Contrastive Predictive Coding

Authors: Maxime Burchi, Radu Timofte

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search. We release our code at https://github.com/burchim/TWISTER. In this section, we describe our experiments on the commonly used Atari 100k benchmark. We compare TWISTER with Sim PLe, Dreamer V3 and recent Transformer model-based approaches in Table 2. We also perform several ablation studies on the principal components of TWISTER.
Researcher Affiliation	Academia	Maxime Burchi, Radu Timofte Computer Vision Lab, CAIDAS & IFI, University of W urzburg, Germany EMAIL
Pseudocode	No	The paper describes the architecture and optimization process of the proposed Transformer-based world model with contrastive representations using equations and textual descriptions, but does not include a dedicated pseudocode or algorithm block.
Open Source Code	Yes	TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search. We release our code at https://github.com/burchim/TWISTER.
Open Datasets	Yes	TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark... The Atari 100k benchmark was proposed in Kaiser et al. (2020) to evaluate reinforcement learning agents on Atari games in low data regime.
Dataset Splits	No	The Atari 100k benchmark was proposed in Kaiser et al. (2020) to evaluate reinforcement learning agents on Atari games in low data regime. The benchmark includes 26 Atari games with a budget of 400k environment frames, amounting to 100k interactions between the agent and the environment using the default action repeat setting.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	Table 9: TWISTER hyper-parameters. We apply the same hyper-parameters to all Atari games. Parameter Symbol Setting General Batch Size B 16 Sequence Length T 64 Optimizer Adam (Kingma & Ba, 2014) Image Resolution 64 64 (RGB) Training Step per Policy Step 1 Environment Instances 1 Transformer Network Transformer Blocks N 4 Number of Attention Heads 8 Dropout Probability 0.1 Attention Context Length 8 World Model Stochastic State Features 32 Classes per Feature 32 Dynamics Loss Scale βdyn 0.5 Representation Loss Scale βreg 0.1 AC-CPC Steps K 10 Random Crop & Resize Scale (0.25, 1.0) Random Crop & Resize Ratio (0.75, 1.33) Learning Rate α 10 4 Adam Betas β1, β2 0.9, 0.999 Adam Epsilon ϵ 10 8 Gradient Clipping 1000 Actor Critic Imagination Horizon H 15 Return Discount γ 0.997 Return Lambda λ 0.95 Critic EMA Decay 0.98 Return Normalization Momentum 0.99 Actor Entropy Scale η 3 10 4 Learning Rate α 3 10 5 Adam Betas β1, β2 0.9, 0.999 Adam Epsilon ϵ 10 5 Gradient Clipping 100