Learning to Constrain Policy Optimization with Virtual Trust Region

Authors: Thai Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify our proposed MCPO through a diverse set of experiments and compare our performance with that of recent constrained policy optimization baselines. In our experiment on classical control tasks, amongst tested models, MCPO consistently achieves better performance across tasks and hyperparameters. Our testbed on 6 Mujoco tasks shows that MCPO with a big policy memory is performant where the attention network plays an important role. We also demonstrate MCPO s capability of learning efficiently on sparse reward and high-dimensional problems such as navigation and Atari games. Finally, our ablation study highlights the necessity of MCPO s components such as the virtual policy and the attention network.
Researcher Affiliation Academia Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh Applied AI Institute, Deakin University, Geelong, Australia thai.le@deakin.edu.au
Pseudocode Yes Algorithm 1 Memory-Constrained Policy Optimization.
Open Source Code Yes Our code is available at https://github.com/thaihungle/MCPO.
Open Datasets Yes Here, we validate our method on sparse reward environments using Mini Grid library [4]. In particular, we test MCPO and other baselines (same as above) on Unlock and Unlock Pickup tasks.
Dataset Splits No The paper discusses training steps and evaluation but does not specify explicit train/validation/test splits of static datasets with percentages or sample counts.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes For MCPO, we fix βmax = 10, βmin = 0.01 and only tune N. More details on the baselines and tasks are given in Appendix B.1. ... train all models on Unlock (find key and open the door) and Unlock Pickup (find key, open the door and pickup an object), for only 100,000 and 1 million environment steps, respectively. ... We pick 6 hard Mujoco tasks and train each model for 10 million environment steps. ... we train all models for only 10 million environment steps.