Action-dependent Control Variates for Policy Optimization via Stein Identity

Authors: Hao Liu*, Yihao Feng*, Yi Mao, Dengyong Zhou, Jian Peng, Qiang Liu

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches.
Researcher Affiliation Collaboration Hao Liu Computer Science UESTC Chengdu, China uestcliuhao@gmail.com Yihao Feng Computer science University of Texas at Austin Austin, TX, 78712 yihao@cs.utexas.edu Yi Mao Microsoft Redmond, WA, 98052 maoyi@microsoft.com Dengyong Zhou Google Kirkland, WA, 98033 dennyzhou@google.com Jian Peng Computer Science UIUC Urbana, IL 61801 jianpeng@illinois.edu Qiang Liu Computer Science University of Texas at Austin Austin, TX, 78712 lqiang@cs.utexas.edu
Pseudocode Yes Algorithm 1 PPO with Control Variate through Stein s Identity (the PPO procedure is adapted from Algorithm 1 in Heess et al. 2017)
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes continuous control environments from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co physics simulator (Todorov et al., 2012).
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Adam (Kingma & Ba, 2014)' but does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes The advantage estimation ˆAπ(st, at) in Eq 16 is done by GAE with λ = 0.98, and γ = 0.995 (Schulman et al., 2016), and correspondingly, ˆQπ(st, at) = ˆAπ(st, at) + ˆV π(st) in (9). Observations and advantage are normalized as suggested by Heess et al. (2017). The neural networks of the policies π(a|s) and baseline functions φw(s, a) use Relu activation units, and the neural network of the value function ˆV π(s) uses Tanh activation units. All our results use Gaussian MLP policy in our experiments with a neural-network mean and a constant diagonal covariance matrix. Denote by ds and da the dimension of the states s and action a, respectively. Network sizes are follows: On Humanoid-v1 and Huamnoid Standup-v1, we use (ds, da 5, 5) for both policy network and value network; On other Mujoco environments, we use (10 ds, 10 ds 5, 5) for both policy network and value network, with learning rate 0.0009 (ds 5) for policy network and 0.0001 (ds 5) for value network. All experiments of PPO with Stein control variate selects the best learning rate from {0.001, 0.0005, 0.0001} for φ networks. We use ADAM (Kingma & Ba, 2014) for gradient descent and evaluate the policy every 20 iterations. Stein control variate is trained for the best iteration in range of {250, 300, 400, 500, 800}.