Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Laws for Imitation Learning in Single-Agent Games
Authors: Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik R Narasimhan, Sham M. Kakade
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of Net Hack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws (and variations of them) for training compute-optimal IL agents. Finally, we forecast and train several Net Hack agents with IL and find our best agent outperforms the prior state-of-the-art by 1.7x in the offline setting. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as helps narrow the gap between the learner and the expert in Net Hack, a game that remains elusively hard for current AI systems. |
| Researcher Affiliation | Collaboration | 1Princeton University, 2Amazon, 3Harvard University, 4University of Pennsylvania EMAIL |
| Pseudocode | No | The paper includes mathematical equations and derivations (e.g., Equation 3, Equation 5, and the derivation in Appendix A) but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code: https://github.com/princeton-nlp/il-scaling-in-games |
| Open Datasets | Yes | We train Transformer-based agents on the NLD-AA dataset (Hambro et al., 2022b), varying both the width and depth (i.e. number of layers) of the model (see Appendix E). The NLD-AA dataset (Hambro et al., 2022b) is released under the Net Hack General Public License and can be found at https://github.com/dungeonsdatasubmission/dungeonsdata-neurips2022. |
| Dataset Splits | Yes | In Figure 1 we plot the loss evaluated on a held-out set of about 100 (for Atari) and 10k (for Net Hack) trajectories against the parameter count for each FLOP budget. |
| Hardware Specification | Yes | All training experiments were done on NVIDIA GPUs (a mix of Ge Force RTX 3090, Ge Force RTX 2080 Ti, RTX A5000, and RTX A6000) and took about 1 2 days depending on the game and FLOP budget. All Net Hack BC experiments were run on NVIDIA H100 80GB GPUs. All Atari BC experiments were run on a mixture of NVIDIA A5000 and A6000 GPUs. The RL experiments were run on V100 32GB GPUs. |
| Software Dependencies | No | The paper mentions software tools and frameworks such as 'PPO (Schulman et al., 2017)', 'Adam (Kingma & Ba, 2014)', 'Stable Baselines3 (Raffin et al., 2021)', 'Adam W (Loshchilov & Hutter, 2019)', and 'RMSprop' but does not specify their version numbers. |
| Experiment Setup | Yes | Table 4: Hyperparameters for all experiments in Atari. We list the hyperparameters for all our BC experiments (a) as well as the ones used to train the PPO expert agent for each game (b). Table 5: Hyperparameters for all experiments in Net Hack. We list the hyperparameters for all our BC experiments (a) as well as the ones for our RL experiments (b). |