The Mirage of Action-Dependent Baselines in Reinforcement Learning

Authors: George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, Sergey Levine

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We decompose the variance of the policy gradient estimator and numerically show that learned state-actiondependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains.
Researcher Affiliation Collaboration 1Google Brain, USA 2Work was done during the Google AI Residency. 3University of Cambridge, UK 4Max Planck Institute for Intelligent Systems, Germany 5Uber AI Labs, USA 6UC Berkeley, USA
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We have made our code and additional visualizations available at https://sites.google.com/view/ mirage-rl.
Open Datasets Yes We numerically evaluate the variance components on a synthetic linear-quadratic Gaussian (LQG) task... and on benchmark continuous control tasks... Humanoid-v1... Half Cheetah-v1... Cart Pole-v0. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.
Dataset Splits No The paper mentions training policies in continuous control environments and reporting mean episode reward with standard deviation. However, it does not specify explicit train/validation/test dataset splits with percentages or counts, which is typical for supervised learning tasks rather than reinforcement learning environments.
Hardware Specification No The paper describes running experiments but does not provide specific details on the hardware used, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper refers to using TRPO and GAE, which are algorithms/methods, but it does not specify version numbers for any software libraries, programming languages, or environments used (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup Yes The batch size across all experiments was 5000.