Is High Variance Unavoidable in RL? A Case Study in Continuous Control
Authors: Johan Bjorck, Carla P Gomes, Kilian Q Weinberger
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we investigate causes for this perceived instability. To allow for an in-depth analysis, we focus on a specifically popular setup with high variance continuous control from pixels with an actor-critic agent. In this setting, we demonstrate that poor outlier runs which completely fail to learn are an important source of variance, but that weight initialization and initial exploration are not at fault. We show that one cause for these outliers is unstable network parametrization which leads to saturating nonlinearities. We investigate several fixes to this issue and find that simply normalizing penultimate features is surprisingly effective. For sparse tasks, we also find that partially disabling clipped double Q-learning decreases variance. By combining fixes we significantly decrease variances, lowering the average standard deviation across 21 tasks by a factor > 3 for a state-of-the-art agent. |
| Researcher Affiliation | Academia | Johan Bjorck, Carla P. Gomes, Kilian Q. Weinberger Cornell University |
| Pseudocode | No | The paper describes algorithms (DDPG, DRQv2) and their modifications in prose, but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | Our experiments are based upon the open-source DRQv2 implementation of Yarats et al. (2021b). This states they used an open-source implementation, not that their own code for their modifications is open-source or available. |
| Open Datasets | Yes | We consider the standard continuous control benchmark deepmind control (dm-control) (Tassa et al., 2020). |
| Dataset Splits | Yes | For each run, we train the agent for one million frames, or equivalently 1,000 episodes, and evaluate over ten episodes. |
| Hardware Specification | Yes | We run our experiments on Nvidia Tesla V100 GPUs and Intel Xeon CPUs. |
| Software Dependencies | Yes | The GPUs use CUDA 11.1 and CUDNN 8.0.0.5. We use Py Torch 1.9.0 and python 3.8.10. |
| Experiment Setup | Yes | We use the default hyperparameters that Yarats et al. (2021b) uses on the medium benchmark (listed in Appendix A) throughout the paper. Table 5: Hyperparameters used throughout the paper. These follow Yarats et al. (2021b) for the medium tasks. |