The Mirage of Action-Dependent Baselines in Reinforcement Learning
Authors: George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, Sergey Levine
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We decompose the variance of the policy gradient estimator and numerically show that learned state-actiondependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. |
| Researcher Affiliation | Collaboration | 1Google Brain, USA 2Work was done during the Google AI Residency. 3University of Cambridge, UK 4Max Planck Institute for Intelligent Systems, Germany 5Uber AI Labs, USA 6UC Berkeley, USA |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have made our code and additional visualizations available at https://sites.google.com/view/ mirage-rl. |
| Open Datasets | Yes | We numerically evaluate the variance components on a synthetic linear-quadratic Gaussian (LQG) task... and on benchmark continuous control tasks... Humanoid-v1... Half Cheetah-v1... Cart Pole-v0. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016. |
| Dataset Splits | No | The paper mentions training policies in continuous control environments and reporting mean episode reward with standard deviation. However, it does not specify explicit train/validation/test dataset splits with percentages or counts, which is typical for supervised learning tasks rather than reinforcement learning environments. |
| Hardware Specification | No | The paper describes running experiments but does not provide specific details on the hardware used, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper refers to using TRPO and GAE, which are algorithms/methods, but it does not specify version numbers for any software libraries, programming languages, or environments used (e.g., Python version, TensorFlow/PyTorch version). |
| Experiment Setup | Yes | The batch size across all experiments was 5000. |