Atari-5: Distilling the Arcade Learning Environment down to Five Games

Authors: Matthew Aitchison, Penny Sweetser, Marcus Hutter

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We applied our method to identify a subset of five ALE games, we call Atari-5, which generally produces 57-game median score estimates to within 10% of their true values. Extending the subset to 10games recovers 80% of the variance for log-scores for all games within the 57-game set. We show this level of compression is possible due to a high degree of correlation between many of the games in ALE.
Researcher Affiliation Collaboration Matthew Aitchison 1 Penny Sweetser 1 Marcus Hutter 2 The University of Australia, Canberra, Australia Deep Mind, United Kingdom.
Pseudocode Yes Algorithm 1 BEST SUBSET: Find the best subset of size C, according to a target summary score.
Open Source Code Yes Source code for this paper can be found at https://github.com/maitchison/Atari-5
Open Datasets Yes We used the website paperswithcode as the primary source of data for our experiments.4 This website contains scores for algorithms with published results on various benchmarks, including ALE. The dataset was then supplemented with additional results from papers not included on the website.
Dataset Splits Yes Subsets were evaluated by fitting linear regression models to the data using the log normalised scores of games within the subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated mean-squared-error.
Hardware Specification Yes Searching over all subsets took approximately 1-hour on a 12-core machine.7 No GPU resources were used.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, NumPy, etc.) used for their implementation.
Experiment Setup Yes Subsets were evaluated by fitting linear regression models to the data using the log normalised scores of games within the subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated mean-squared-error. For these models, we disabled intercepts as we wanted a random policy to produce a score of 0.6