reproducibilityindex.ai

Mirror Descent Policy Optimization

Authors: Manan Tomar, Lior Shani, Yonathan Efroni, Mohammad Ghavamzadeh

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically evaluate our on-policy and off-policy MDPO algorithms on a number of continuous control tasks from Open AI Gym [7], and compare them with state-of-the-art baselines: TRPO, PPO, and SAC. We report all experimental details, including the hyper-parameter values used by the algorithms, in Appendix B. In the tabular results, both in the main paper and in Appendices E and F, we report the ﬁnal training scores averaged over 5 runs and their 95% conﬁdence intervals (CI).
Researcher Affiliation	Collaboration	Manan Tomar University of Alberta, Amii manan.tomar@gmail.comLior Shani Technion, Israel shanlior@gmail.comYonathan Efroni Microsoft Research NYC yefroni@microsoft.comMohammad Ghavamzadeh Google Research ghavamza@google.com
Pseudocode	Yes	Below we provide the pseudocodes for the two MDPO algorithms, on-policy and off-policy.Algorithm 1 On-Policy MDPO; Algorithm 2 Off-Policy MDPO; Algorithm 3 Off-Policy MDPO (Soft)
Open Source Code	No	The paper does not contain any explicit statement or link indicating the release of source code for the described methodology.
Open Datasets	Yes	We evaluate all algorithms on Open AI Gym [7] based continuous control tasks, including Hopper-v2, Walker2d-v2, Half Cheetah-v2, Ant-v2, Humanoid-v2 and Humanoid Standup-v2... We also compare on-policy MDPO and PPO on 21 Atari games from the ALE benchmark [5].
Dataset Splits	No	The paper does not provide explicit dataset splits for training, validation, and testing in the traditional supervised learning sense. In reinforcement learning, data is often generated dynamically through environment interaction rather than being statically split.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., 'TensorFlow' or 'PyTorch' versions) for reproducibility.
Experiment Setup	Yes	We report all experimental details, including the hyper-parameter values used by the algorithms, in Appendix B... Table 2: Hyper-parameters of all on-policy methods. Table 3: Hyper-parameters of all off-policy methods. Table 4: Bregman stepsize for each domain, used by off-policy MDPO.