reproducibilityindex.ai

Learning to Constrain Policy Optimization with Virtual Trust Region

Authors: Thai Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We verify our proposed MCPO through a diverse set of experiments and compare our performance with that of recent constrained policy optimization baselines. In our experiment on classical control tasks, amongst tested models, MCPO consistently achieves better performance across tasks and hyperparameters. Our testbed on 6 Mujoco tasks shows that MCPO with a big policy memory is performant where the attention network plays an important role. We also demonstrate MCPO s capability of learning efﬁciently on sparse reward and high-dimensional problems such as navigation and Atari games. Finally, our ablation study highlights the necessity of MCPO s components such as the virtual policy and the attention network.
Researcher Affiliation	Academia	Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh Applied AI Institute, Deakin University, Geelong, Australia thai.le@deakin.edu.au
Pseudocode	Yes	Algorithm 1 Memory-Constrained Policy Optimization.
Open Source Code	Yes	Our code is available at https://github.com/thaihungle/MCPO.
Open Datasets	Yes	Here, we validate our method on sparse reward environments using Mini Grid library [4]. In particular, we test MCPO and other baselines (same as above) on Unlock and Unlock Pickup tasks.
Dataset Splits	No	The paper discusses training steps and evaluation but does not specify explicit train/validation/test splits of static datasets with percentages or sample counts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	For MCPO, we ﬁx βmax = 10, βmin = 0.01 and only tune N. More details on the baselines and tasks are given in Appendix B.1. ... train all models on Unlock (ﬁnd key and open the door) and Unlock Pickup (ﬁnd key, open the door and pickup an object), for only 100,000 and 1 million environment steps, respectively. ... We pick 6 hard Mujoco tasks and train each model for 10 million environment steps. ... we train all models for only 10 million environment steps.