Monotonic Robust Policy Optimization with Model Discrepancy

Authors: Yuankun Jiang, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluations in several robot control tasks demonstrate that MRPO can generally improve both the average and worst-case performance in the source environments for training, and facilitate in all cases the learned policy with a better generalization capability in some unseen testing environments. We now evaluate the proposed MRPO in six robot control benchmarks designed for evaluation of generalization under changeable dynamics.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China.
Pseudocode Yes Algorithm 1 Monotonic Robust Policy Optimization and Algorithm 2 Practical Implementation of MRPO
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper mentions using 'open-source robot control simulation environment, Roboschool (Schulman & Klimov, 2017)' and 'open-source generalization benchmarks (Packer et al., 2018)' to set up the environments, but it does not provide concrete access information (link, DOI, specific citation with authors/year) for a pre-existing dataset used for training or state that the generated data is made public.
Dataset Splits No The paper discusses sampling environments for training and testing and evaluating performance, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or absolute counts) or cite pre-defined splits for reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions using 'Roboschool' and 'PPO' for implementation and policy optimization, but it does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks.
Experiment Setup Yes We utilize two 64-unit hidden layers to construct the policy network and value function in PPO. For MRPO, we use the practical implementation as described in Algorithm 2. ... At each iteration k, we generate trajectories from M = 100 environments sampled according to a uniform distribution U. Referring to Appendix A.7, we sample L = 1 trajectory for each environment... Referring to Appendix A.11, we set α = 10% for PW-DR... In Algorithm 2, when we update the sampling distribution P for policy optimization, κ is a hyperparameter...