Action-dependent Control Variates for Policy Optimization via Stein Identity
Authors: Hao Liu*, Yihao Feng*, Yi Mao, Dengyong Zhou, Jian Peng, Qiang Liu
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical studies show that our method significantly improves the sample efficiency of the state-of-the-art policy gradient approaches. |
| Researcher Affiliation | Collaboration | Hao Liu Computer Science UESTC Chengdu, China uestcliuhao@gmail.com Yihao Feng Computer science University of Texas at Austin Austin, TX, 78712 yihao@cs.utexas.edu Yi Mao Microsoft Redmond, WA, 98052 maoyi@microsoft.com Dengyong Zhou Google Kirkland, WA, 98033 dennyzhou@google.com Jian Peng Computer Science UIUC Urbana, IL 61801 jianpeng@illinois.edu Qiang Liu Computer Science University of Texas at Austin Austin, TX, 78712 lqiang@cs.utexas.edu |
| Pseudocode | Yes | Algorithm 1 PPO with Control Variate through Stein s Identity (the PPO procedure is adapted from Algorithm 1 in Heess et al. 2017) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository. |
| Open Datasets | Yes | continuous control environments from the Open AI Gym benchmark (Brockman et al., 2016) using the Mu Jo Co physics simulator (Todorov et al., 2012). |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Adam (Kingma & Ba, 2014)' but does not provide specific version numbers for any software dependencies (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | The advantage estimation ˆAπ(st, at) in Eq 16 is done by GAE with λ = 0.98, and γ = 0.995 (Schulman et al., 2016), and correspondingly, ˆQπ(st, at) = ˆAπ(st, at) + ˆV π(st) in (9). Observations and advantage are normalized as suggested by Heess et al. (2017). The neural networks of the policies π(a|s) and baseline functions φw(s, a) use Relu activation units, and the neural network of the value function ˆV π(s) uses Tanh activation units. All our results use Gaussian MLP policy in our experiments with a neural-network mean and a constant diagonal covariance matrix. Denote by ds and da the dimension of the states s and action a, respectively. Network sizes are follows: On Humanoid-v1 and Huamnoid Standup-v1, we use (ds, da 5, 5) for both policy network and value network; On other Mujoco environments, we use (10 ds, 10 ds 5, 5) for both policy network and value network, with learning rate 0.0009 (ds 5) for policy network and 0.0001 (ds 5) for value network. All experiments of PPO with Stein control variate selects the best learning rate from {0.001, 0.0005, 0.0001} for φ networks. We use ADAM (Kingma & Ba, 2014) for gradient descent and evaluate the policy every 20 iterations. Stein control variate is trained for the best iteration in range of {250, 300, 400, 500, 800}. |