Learning to Constrain Policy Optimization with Virtual Trust Region
Authors: Thai Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify our proposed MCPO through a diverse set of experiments and compare our performance with that of recent constrained policy optimization baselines. In our experiment on classical control tasks, amongst tested models, MCPO consistently achieves better performance across tasks and hyperparameters. Our testbed on 6 Mujoco tasks shows that MCPO with a big policy memory is performant where the attention network plays an important role. We also demonstrate MCPO s capability of learning efficiently on sparse reward and high-dimensional problems such as navigation and Atari games. Finally, our ablation study highlights the necessity of MCPO s components such as the virtual policy and the attention network. |
| Researcher Affiliation | Academia | Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh Applied AI Institute, Deakin University, Geelong, Australia thai.le@deakin.edu.au |
| Pseudocode | Yes | Algorithm 1 Memory-Constrained Policy Optimization. |
| Open Source Code | Yes | Our code is available at https://github.com/thaihungle/MCPO. |
| Open Datasets | Yes | Here, we validate our method on sparse reward environments using Mini Grid library [4]. In particular, we test MCPO and other baselines (same as above) on Unlock and Unlock Pickup tasks. |
| Dataset Splits | No | The paper discusses training steps and evaluation but does not specify explicit train/validation/test splits of static datasets with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | For MCPO, we fix βmax = 10, βmin = 0.01 and only tune N. More details on the baselines and tasks are given in Appendix B.1. ... train all models on Unlock (find key and open the door) and Unlock Pickup (find key, open the door and pickup an object), for only 100,000 and 1 million environment steps, respectively. ... We pick 6 hard Mujoco tasks and train each model for 10 million environment steps. ... we train all models for only 10 million environment steps. |