Reduced Policy Optimization for Continuous Control with Hard Constraints

Authors: Shutong Ding, Jingya Wang, Yali Du, Ye Shi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on these benchmarks demonstrate the superiority of RPO in terms of both cumulative reward and constraint violation. We believe RPO, along with the new benchmarks, will open up new opportunities for applying RL to real-world problems with complex constraints.
Researcher Affiliation Academia Shutong Ding1 Jingya Wang1 Yali Du2 Ye Shi1 1Shanghai Tech University 2King s College London
Pseudocode Yes Algorithm 1 Training Procedure of RPO Algorithm 2 Generalized Reduced Gradient Algorithm Algorithm 3 RPO-DDPG Algorithm 4 RPO-SAC
Open Source Code Yes Our code is available at: https://github.com/wadx2019/rpo.
Open Datasets Yes Specifically, our benchmarks are designed based on [12], with extra interfaces to return the information of the hard constraints. The data on power demand and day-ahead electricity prices are from [1, 2].
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and test sets, beyond the general concept of training within an RL environment.
Hardware Specification Yes We implemented our experiments on a GPU of NVIDIA Ge Force RTX 3090 with 24GB.
Software Dependencies Yes The implementation of three safe RL algorithms in our experiments are based on omnisafe 2 and safe-explorer 3, and recommended values are adopted for hyper-parameters not mentioned in the following tables.
Experiment Setup Yes Parameter tables (Table 4, Table 5, and Table 6) list various hyperparameters such as Batch Size, Discount Factor, Learning Rates for Policy and Value Networks, Temperature, and Max GRG Updates for each experiment.