Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws for Imitation Learning in Single-Agent Games

Authors: Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik R Narasimhan, Sham M. Kakade

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of Net Hack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws (and variations of them) for training compute-optimal IL agents. Finally, we forecast and train several Net Hack agents with IL and find our best agent outperforms the prior state-of-the-art by 1.7x in the offline setting. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as helps narrow the gap between the learner and the expert in Net Hack, a game that remains elusively hard for current AI systems.
Researcher Affiliation	Collaboration	1Princeton University, 2Amazon, 3Harvard University, 4University of Pennsylvania EMAIL
Pseudocode	No	The paper includes mathematical equations and derivations (e.g., Equation 3, Equation 5, and the derivation in Appendix A) but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code: https://github.com/princeton-nlp/il-scaling-in-games
Open Datasets	Yes	We train Transformer-based agents on the NLD-AA dataset (Hambro et al., 2022b), varying both the width and depth (i.e. number of layers) of the model (see Appendix E). The NLD-AA dataset (Hambro et al., 2022b) is released under the Net Hack General Public License and can be found at https://github.com/dungeonsdatasubmission/dungeonsdata-neurips2022.
Dataset Splits	Yes	In Figure 1 we plot the loss evaluated on a held-out set of about 100 (for Atari) and 10k (for Net Hack) trajectories against the parameter count for each FLOP budget.
Hardware Specification	Yes	All training experiments were done on NVIDIA GPUs (a mix of Ge Force RTX 3090, Ge Force RTX 2080 Ti, RTX A5000, and RTX A6000) and took about 1 2 days depending on the game and FLOP budget. All Net Hack BC experiments were run on NVIDIA H100 80GB GPUs. All Atari BC experiments were run on a mixture of NVIDIA A5000 and A6000 GPUs. The RL experiments were run on V100 32GB GPUs.
Software Dependencies	No	The paper mentions software tools and frameworks such as 'PPO (Schulman et al., 2017)', 'Adam (Kingma & Ba, 2014)', 'Stable Baselines3 (Raffin et al., 2021)', 'Adam W (Loshchilov & Hutter, 2019)', and 'RMSprop' but does not specify their version numbers.
Experiment Setup	Yes	Table 4: Hyperparameters for all experiments in Atari. We list the hyperparameters for all our BC experiments (a) as well as the ones used to train the PPO expert agent for each game (b). Table 5: Hyperparameters for all experiments in Net Hack. We list the hyperparameters for all our BC experiments (a) as well as the ones for our RL experiments (b).