Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Laws for Pre-training Agents and World Models

Authors: Tim Pearce, Tabish Rashid, David Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Under this setting, we train families of transformers on next-token prediction tasks using architectures popular in both world modeling and BC tasks. This leads to several contributions, summarized in Figure 1. ... Section 4 presents our main results in world modeling and BC.
Researcher Affiliation Industry 1Microsoft Research (SD now at Meta). Correspondence to: Tim Pearce <.>
Pseudocode No The paper describes methodologies in text and mathematical formulations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No All transformers are trained with a variant of nano GPT (Karpathy, 2022) using Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). ... Shakespeare character dataset from: https://github. com/karpathy/nano GPT ... We trained a set of five VQVAEs using the implementation from https://github.com/nadavbh12/VQ-VAE. The paper mentions using third-party open-source tools and datasets, but does not provide specific access information or a clear statement about releasing its own implementation code for the described methodology.
Open Datasets Yes Our work primarily focuses on a dataset of human behavior collected in a video game Bleeding Edge. ... As a secondary dataset we use RT-1 (Brohan et al., 2022), comprising 14 days of humans operating a robotic arm on a range of manipulation tasks such as pick banana from white bowl. ... Shakespeare character dataset from: https://github. com/karpathy/nano GPT ... We used the Book Corpus dataset (Zhu et al., 2015)
Dataset Splits Yes The 7 Maps dataset comprised 60,986 matches, yielding 530,713 individual player trajectories (each around 9 minutes), totaling 27.89 Ti B on disk. ... This was then divided into training / validation / test sets by dividing the matches with an 80:10:10 split. Our filtered Sky Garden dataset used the same 80:10:10 split and 10Hz downsampling, but focused on just one map, yielding 71,940 individual player trajectories, or 355.5M frames (around 1.12 years of game play).
Hardware Specification No The paper does not explicitly state the specific hardware (e.g., GPU models, CPU types) used for running its experiments, only discussing computational budget (FLOPs).
Software Dependencies No All transformers are trained with a variant of nano GPT (Karpathy, 2022) using Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019). ... In our implementation, we use Sci Py s curve_fit function. The paper mentions software tools like nano GPT, PyTorch Lightning, and SciPy, but does not specify their exact version numbers required for reproducibility.
Experiment Setup Yes Table 3. Hyperparameters for WM-Token with dz =256 tokens per image observation. ... Table 4. Hyperparameters for WM-Token with dz =540 tokens per image observation. ... Table 5. Hyperparameters for BC-Token with dz =540 tokens per image observation. ... Table 6. Hyperparameters for BC-CNN. ... We follow the approach of using a constant learning rate per model, so each requires only one training run. We aim to train models until they have passed their compute efficient FLOPs budget. We only modify the parameters of the transformer, following the configurations documented in Appendix A.