reproducibilityindex.ai

Atari-5: Distilling the Arcade Learning Environment down to Five Games

Authors: Matthew Aitchison, Penny Sweetser, Marcus Hutter

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We applied our method to identify a subset of five ALE games, we call Atari-5, which generally produces 57-game median score estimates to within 10% of their true values. Extending the subset to 10games recovers 80% of the variance for log-scores for all games within the 57-game set. We show this level of compression is possible due to a high degree of correlation between many of the games in ALE.
Researcher Affiliation	Collaboration	Matthew Aitchison 1 Penny Sweetser 1 Marcus Hutter 2 The University of Australia, Canberra, Australia Deep Mind, United Kingdom.
Pseudocode	Yes	Algorithm 1 BEST SUBSET: Find the best subset of size C, according to a target summary score.
Open Source Code	Yes	Source code for this paper can be found at https://github.com/maitchison/Atari-5
Open Datasets	Yes	We used the website paperswithcode as the primary source of data for our experiments.4 This website contains scores for algorithms with published results on various benchmarks, including ALE. The dataset was then supplemented with additional results from papers not included on the website.
Dataset Splits	Yes	Subsets were evaluated by fitting linear regression models to the data using the log normalised scores of games within the subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated mean-squared-error.
Hardware Specification	Yes	Searching over all subsets took approximately 1-hour on a 12-core machine.7 No GPU resources were used.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python version, library versions like PyTorch, NumPy, etc.) used for their implementation.
Experiment Setup	Yes	Subsets were evaluated by fitting linear regression models to the data using the log normalised scores of games within the subset to predict the median overall, selecting the model with the lowest 10-fold cross-validated mean-squared-error. For these models, we disabled intercepts as we wanted a random policy to produce a score of 0.6