Gradient-Adaptive Pareto Optimization for Constrained Reinforcement Learning

Authors: Zixian Zhou, Mengda Huang, Feiyang Pan, Jia He, Xiang Ao, Dandan Tu, Qing He

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Safety Gym benchmarks show that our method consistently outperforms previous CRL methods in reward while satisfying the constraints.
Researcher Affiliation Collaboration 1 Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 Huawei EI Innovation Lab
Pseudocode Yes We introduce how GCPO trains an agent and modifies t in a pseudo-code, which can be found in Appendix E.
Open Source Code No The paper does not provide an explicit statement about the release of open-source code or a link to a code repository for their method.
Open Datasets Yes To validate our proposed algorithm, we conduct experiments on Safety Gym (Ray, Achiam, and Amodei 2019), a CRL benchmark.
Dataset Splits No The paper mentions '5 runs (1000 episodes each, 10000 steps each episode) with different random seeds' and 'test runs' but does not specify a separate validation dataset split.
Hardware Specification No The paper states that experiments are conducted in the Safety Gym environment simulated in Mujoco, but it does not specify any details about the hardware (e.g., CPU, GPU, memory) used for these simulations or training.
Software Dependencies No The paper mentions using 'Actor-critic-based PPO (Schulman et al. 2015) as the base model of GCPO' and 'Mujoco (Todorov, Erez, and Tassa 2012)', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes For reproducibility, a detailed statement about the architectures and hyper-parameters is presented in Appendix F.