reproducibilityindex.ai

The Mirage of Action-Dependent Baselines in Reinforcement Learning

Authors: George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard Turner, Zoubin Ghahramani, Sergey Levine

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We decompose the variance of the policy gradient estimator and numerically show that learned state-actiondependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We conﬁrm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains.
Researcher Affiliation	Collaboration	1Google Brain, USA 2Work was done during the Google AI Residency. 3University of Cambridge, UK 4Max Planck Institute for Intelligent Systems, Germany 5Uber AI Labs, USA 6UC Berkeley, USA
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	We have made our code and additional visualizations available at https://sites.google.com/view/ mirage-rl.
Open Datasets	Yes	We numerically evaluate the variance components on a synthetic linear-quadratic Gaussian (LQG) task... and on benchmark continuous control tasks... Humanoid-v1... Half Cheetah-v1... Cart Pole-v0. Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.
Dataset Splits	No	The paper mentions training policies in continuous control environments and reporting mean episode reward with standard deviation. However, it does not specify explicit train/validation/test dataset splits with percentages or counts, which is typical for supervised learning tasks rather than reinforcement learning environments.
Hardware Specification	No	The paper describes running experiments but does not provide specific details on the hardware used, such as GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper refers to using TRPO and GAE, which are algorithms/methods, but it does not specify version numbers for any software libraries, programming languages, or environments used (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup	Yes	The batch size across all experiments was 5000.