Is High Variance Unavoidable in RL? A Case Study in Continuous Control

Authors: Johan Bjorck, Carla P Gomes, Kilian Q Weinberger

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that poor outlier runs which completely fail to learn are an important source of variance, but that weight initialization and initial exploration are not at fault. We show that one cause for these outliers is unstable network parametrization which leads to saturating nonlinearities. We investigate several fixes to this issue and find that simply normalizing penultimate features is surprisingly effective. For sparse tasks, we also find that partially disabling clipped double Q-learning decreases variance. By combining fixes we significantly decrease variances, lowering the average standard deviation across 21 tasks by a factor > 3 for a state-of-the-art agent.
Researcher Affiliation Academia Johan Bjorck, Carla P. Gomes, Kilian Q. Weinberger Cornell University
Pseudocode No The paper describes algorithms (DDPG, DRQv2) and their modifications in prose, but does not include any pseudocode or algorithm blocks.
Open Source Code No Our experiments are based upon the open-source DRQv2 implementation of Yarats et al. (2021b). This states they used an open-source implementation, not that their own code for their modifications is open-source or available.
Open Datasets Yes We consider the standard continuous control benchmark deepmind control (dm-control) (Tassa et al., 2020).
Dataset Splits Yes For each run, we train the agent for one million frames, or equivalently 1,000 episodes, and evaluate over ten episodes.
Hardware Specification Yes We run our experiments on Nvidia Tesla V100 GPUs and Intel Xeon CPUs.
Software Dependencies Yes The GPUs use CUDA 11.1 and CUDNN 8.0.0.5. We use Py Torch 1.9.0 and python 3.8.10.
Experiment Setup Yes We use the default hyperparameters that Yarats et al. (2021b) uses on the medium benchmark (listed in Appendix A) throughout the paper. Table 5: Hyperparameters used throughout the paper. These follow Yarats et al. (2021b) for the medium tasks.