Deep Reinforcement Learning at the Edge of the Statistical Precipice
Authors: Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, Marc Bellemare
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Deep reinforcement learning (RL) algorithms are predominantly evaluated by comparing their relative performance on a large suite of tasks. Most published results on deep RL benchmarks compare point estimates of aggregate performance such as mean and median scores across tasks, ignoring the statistical uncertainty implied by the use of a finite number of training runs. In this paper, we argue that reliable evaluation in the few-run deep RL regime cannot ignore the uncertainty in results without running the risk of slowing down progress in the field. We illustrate this point using a case study on the Atari 100k benchmark, where we find substantial discrepancies between conclusions drawn from point estimates alone versus a more thorough statistical analysis. |
| Researcher Affiliation | Collaboration | Rishabh Agarwal Google Research, Brain Team MILA, Université de Montréal Max Schwarzer MILA, Université de Montréal Pablo Samuel Castro Google Research, Brain Team Aaron Courville MILA, Université de Montréal Marc G. Bellemare Google Research, Brain Team |
| Pseudocode | No | The paper describes methodologies in text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | accompanied with an open-source library rliable2, to prevent unreliable results from stagnating the field.2https://github.com/google-research/rliable |
| Open Datasets | Yes | Using such statistical tools, we scrutinize performance evaluations of existing algorithms on other widely used RL benchmarks including the ALE, Procgen, and the Deep Mind Control Suite, again revealing discrepancies in prior comparisons. [5] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253 279, 2013. |
| Dataset Splits | No | The paper analyzes performance on existing RL benchmarks (ALE, Atari 100k, Procgen, Deep Mind Control Suite) which have their own evaluation protocols, but it does not specify explicit training, validation, or test dataset splits for its own meta-analysis. |
| Hardware Specification | Yes | All our experiments are run on Google Cloud Platform with Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions the use of JAX [13] and Dopamine [14] frameworks, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We investigate statistical variations in the few-run regime by evaluating 100 independent runs for each algorithm, where the score for a run is the average returns obtained in 100 evaluation episodes taking place after training. Each run corresponds to training one algorithm on each of the 26 games in Atari 100k. Refer to Appendix A.2 for more details about the experimental setup. |