Monotonic Robust Policy Optimization with Model Discrepancy
Authors: Yuankun Jiang, Chenglin Li, Wenrui Dai, Junni Zou, Hongkai Xiong
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations in several robot control tasks demonstrate that MRPO can generally improve both the average and worst-case performance in the source environments for training, and facilitate in all cases the learned policy with a better generalization capability in some unseen testing environments. We now evaluate the proposed MRPO in six robot control benchmarks designed for evaluation of generalization under changeable dynamics. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China. |
| Pseudocode | Yes | Algorithm 1 Monotonic Robust Policy Optimization and Algorithm 2 Practical Implementation of MRPO |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper mentions using 'open-source robot control simulation environment, Roboschool (Schulman & Klimov, 2017)' and 'open-source generalization benchmarks (Packer et al., 2018)' to set up the environments, but it does not provide concrete access information (link, DOI, specific citation with authors/year) for a pre-existing dataset used for training or state that the generated data is made public. |
| Dataset Splits | No | The paper discusses sampling environments for training and testing and evaluating performance, but it does not specify explicit training/validation/test dataset splits (e.g., percentages or absolute counts) or cite pre-defined splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Roboschool' and 'PPO' for implementation and policy optimization, but it does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We utilize two 64-unit hidden layers to construct the policy network and value function in PPO. For MRPO, we use the practical implementation as described in Algorithm 2. ... At each iteration k, we generate trajectories from M = 100 environments sampled according to a uniform distribution U. Referring to Appendix A.7, we sample L = 1 trajectory for each environment... Referring to Appendix A.11, we set α = 10% for PW-DR... In Algorithm 2, when we update the sampling distribution P for policy optimization, κ is a hyperparameter... |