Deep Reinforcement Learning at the Edge of the Statistical Precipice

Authors: Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, Marc Bellemare

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis.
Researcher Affiliation Collaboration Rishabh Agarwal Google Research, Brain Team MILA, Université de Montréal Max Schwarzer MILA, Université de Montréal Pablo Samuel Castro Google Research, Brain Team Aaron Courville MILA, Université de Montréal Marc G. Bellemare Google Research, Brain Team
Pseudocode No The paper describes methodologies in text but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes accompanied with an open-source library rliable2, to prevent unreliable results from stagnating the field.2https://github.com/google-research/rliable
Open Datasets Yes Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the Deep Mind Control Suite, again revealing discrepancies in prior comparisons. [5] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013.
Dataset Splits No The paper analyzes performance on existing RL benchmarks (ALE, Atari 100k, Procgen, Deep Mind Control Suite) which have their own evaluation protocols, but it does not specify explicit training, validation, or test dataset splits for its own meta-analysis.
Hardware Specification Yes All our experiments are run on Google Cloud Platform with Tesla V100 GPUs.
Software Dependencies No The paper mentions the use of JAX [13] and Dopamine [14] frameworks, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We investigate statistical variations in the few-run regime by evaluating 100 independent runs for each algorithm, where the score for a run is the average returns obtained in 100 evaluation episodes taking place after training. Each run corresponds to training one algorithm on each of the 26 games in Atari 100k. Refer to Appendix A.2 for more details about the experimental setup.