reproducibilityindex.ai

Deep Reinforcement Learning at the Edge of the Statistical Precipice

Authors: Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, Marc Bellemare

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a ﬁnite number of training runs. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the ﬁeld. We illustrate this point using a case study on the Atari 100k benchmark, where we ﬁnd substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis.
Researcher Affiliation	Collaboration	Rishabh Agarwal Google Research, Brain Team MILA, Université de Montréal Max Schwarzer MILA, Université de Montréal Pablo Samuel Castro Google Research, Brain Team Aaron Courville MILA, Université de Montréal Marc G. Bellemare Google Research, Brain Team
Pseudocode	No	The paper describes methodologies in text but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	accompanied with an open-source library rliable2, to prevent unreliable results from stagnating the ﬁeld.2https://github.com/google-research/rliable
Open Datasets	Yes	Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the Deep Mind Control Suite, again revealing discrepancies in prior comparisons. [5] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artiﬁcial Intelligence Research, 47:253 279, 2013.
Dataset Splits	No	The paper analyzes performance on existing RL benchmarks (ALE, Atari 100k, Procgen, Deep Mind Control Suite) which have their own evaluation protocols, but it does not specify explicit training, validation, or test dataset splits for its own meta-analysis.
Hardware Specification	Yes	All our experiments are run on Google Cloud Platform with Tesla V100 GPUs.
Software Dependencies	No	The paper mentions the use of JAX [13] and Dopamine [14] frameworks, but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We investigate statistical variations in the few-run regime by evaluating 100 independent runs for each algorithm, where the score for a run is the average returns obtained in 100 evaluation episodes taking place after training. Each run corresponds to training one algorithm on each of the 26 games in Atari 100k. Refer to Appendix A.2 for more details about the experimental setup.