Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws for Pre-training Agents and World Models

Authors: Tim Pearce, Tabish Rashid, David Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Under this setting, we train families of transformers on next-token prediction tasks using architectures popular in both world modeling and BC tasks. This leads to several contributions, summarized in Figure 1. ... Section 4 presents our main results in world modeling and BC.
Researcher Affiliation	Industry	1Microsoft Research (SD now at Meta). Correspondence to: Tim Pearce <.>
Pseudocode	No	The paper describes methodologies in text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	All transformers are trained with a variant of nano GPT (Karpathy, 2022) using Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). ... Shakespeare character dataset from: https://github. com/karpathy/nano GPT ... We trained a set of five VQVAEs using the implementation from https://github.com/nadavbh12/VQ-VAE. The paper mentions using third-party open-source tools and datasets, but does not provide specific access information or a clear statement about releasing its own implementation code for the described methodology.
Open Datasets	Yes	Our work primarily focuses on a dataset of human behavior collected in a video game Bleeding Edge. ... As a secondary dataset we use RT-1 (Brohan et al., 2022), comprising 14 days of humans operating a robotic arm on a range of manipulation tasks such as pick banana from white bowl. ... Shakespeare character dataset from: https://github. com/karpathy/nano GPT ... We used the Book Corpus dataset (Zhu et al., 2015)
Dataset Splits	Yes	The 7 Maps dataset comprised 60,986 matches, yielding 530,713 individual player trajectories (each around 9 minutes), totaling 27.89 Ti B on disk. ... This was then divided into training / validation / test sets by dividing the matches with an 80:10:10 split. Our filtered Sky Garden dataset used the same 80:10:10 split and 10Hz downsampling, but focused on just one map, yielding 71,940 individual player trajectories, or 355.5M frames (around 1.12 years of game play).
Hardware Specification	No	The paper does not explicitly state the specific hardware (e.g., GPU models, CPU types) used for running its experiments, only discussing computational budget (FLOPs).
Software Dependencies	No	All transformers are trained with a variant of nano GPT (Karpathy, 2022) using Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). ... In our implementation, we use Sci Py s curve_fit function. The paper mentions software tools like nano GPT, PyTorch Lightning, and SciPy, but does not specify their exact version numbers required for reproducibility.
Experiment Setup	Yes	Table 3. Hyperparameters for WM-Token with dz =256 tokens per image observation. ... Table 4. Hyperparameters for WM-Token with dz =540 tokens per image observation. ... Table 5. Hyperparameters for BC-Token with dz =540 tokens per image observation. ... Table 6. Hyperparameters for BC-CNN. ... We follow the approach of using a constant learning rate per model, so each requires only one training run. We aim to train models until they have passed their compute efficient FLOPs budget. We only modify the parameters of the transformer, following the configurations documented in Appendix A.