Mirror Descent Policy Optimization
Authors: Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically evaluate our on-policy and off-policy MDPO algorithms on a number of continuous control tasks from Open AI Gym [7], and compare them with state-of-the-art baselines: TRPO, PPO, and SAC. We report all experimental details, including the hyper-parameter values used by the algorithms, in Appendix B. In the tabular results, both in the main paper and in Appendices E and F, we report the final training scores averaged over 5 runs and their 95% confidence intervals (CI). |
| Researcher Affiliation | Collaboration | Manan Tomar University of Alberta, Amii manan.tomar@gmail.comLior Shani Technion, Israel shanlior@gmail.comYonathan Efroni Microsoft Research NYC yefroni@microsoft.comMohammad Ghavamzadeh Google Research ghavamza@google.com |
| Pseudocode | Yes | Below we provide the pseudocodes for the two MDPO algorithms, on-policy and off-policy.Algorithm 1 On-Policy MDPO; Algorithm 2 Off-Policy MDPO; Algorithm 3 Off-Policy MDPO (Soft) |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating the release of source code for the described methodology. |
| Open Datasets | Yes | We evaluate all algorithms on Open AI Gym [7] based continuous control tasks, including Hopper-v2, Walker2d-v2, Half Cheetah-v2, Ant-v2, Humanoid-v2 and Humanoid Standup-v2... We also compare on-policy MDPO and PPO on 21 Atari games from the ALE benchmark [5]. |
| Dataset Splits | No | The paper does not provide explicit dataset splits for training, validation, and testing in the traditional supervised learning sense. In reinforcement learning, data is often generated dynamically through environment interaction rather than being statically split. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., 'TensorFlow' or 'PyTorch' versions) for reproducibility. |
| Experiment Setup | Yes | We report all experimental details, including the hyper-parameter values used by the algorithms, in Appendix B... Table 2: Hyper-parameters of all on-policy methods. Table 3: Hyper-parameters of all off-policy methods. Table 4: Bregman stepsize for each domain, used by off-policy MDPO. |