Policy Optimization with Stochastic Mirror Descent
Authors: Long Yang, Yu Zhang, Gang Zheng, Qian Zheng, Pengfei Li, Jianhang Huang, Gang Pan8823-8831
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper proposes VRMPO algorithm: a sample efficient policy gradient method with stochastic mirror descent. In VRMPO, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed VRMPO needs only O(ϵ 3) sample trajectories to achieve an ϵ-approximate first-order stationary point, which matches the best sample complexity for policy optimization. Extensive empirical results demonstrate that VRMPO outperforms the state-of-the-art policy gradient methods in various settings. |
| Researcher Affiliation | Collaboration | Long Yang 1, *, Yu Zhang 2, *, Gang Zheng1, Qian Zheng1,3, Pengfei Li1, Jianghang Huang 1, Gang Pan 1, 1College of Computer Science and Technology, Zhejiang University, China 2Netease Games AI Lab, Hangzhou, China 3School of Electrical and Electronic Engineering, Nanyang Technological University,Singapore |
| Pseudocode | Yes | Algorithm 1: MPO; Algorithm 2: VRMPO. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We provide a numerical analysis of MPO, and compare the convergence rate of MPO with REINFORCE and VPG on the Short Corridor with Switched Actions (SASC) domain (Sutton and Barto 2018). ...To demonstrate the stability and efficiency of VRMPO on the Mu Jo Co continuous control tasks, we provide a comprehensive comparison to state-of-the-art policy optimization algorithms. |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits, specific percentages, or a detailed splitting methodology for the experiments conducted on SASC or MuJoCo environments. |
| Hardware Specification | No | The paper does not specify the hardware used for running its experiments, such as specific GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., programming languages, libraries, or frameworks) used in the experiments. |
| Experiment Setup | Yes | The discounter γ = 0.99 and the step-size α is chosen by a grid search from the set {0.01, 0.02, 0.04, 0.08, 0.1}. We use a two-layer feedforward neural network of 200 and 100 hidden nodes, respectively, with rectified linear units (ReLU) activation function between each layer. For each step t, we construct a critic network Qω(s, a) with the parameter ω, sample {(si, ai)}N i=1 from a data memory D, and learn the parameter ω via minimizing the critic loss as follows, [...] we run 5000 iterations for each epoch. |