Constraints Penalized Q-learning for Safe Offline Reinforcement Learning

Authors: Haoran Xu, Xianyuan Zhan, Xiangyu Zhu8753-8760

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines. Through systematic experiments, we show that our algorithm can learn robustly to maximize rewards while successfully satisfying safety constraints, outperform all baselines in benchmark continuous control tasks. We conducted experiments on three Mujoco tasks: Hopper-v2, Half Cheetah-v2 and Walker2d-v2.
Researcher Affiliation Collaboration 1 School of Computer Science and Technology, Xidian University, Xi an, China 2 JD i City, JD Technology, Beijing, China 3 JD Intelligent Cities Research 4 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Pseudocode Yes The pseudo-code of CPQ is presented in Algorithm 1.
Open Source Code No The paper does not provide any explicit statement about releasing code or a link to a code repository for the described methodology.
Open Datasets No The paper states, 'For each environment, we collect data using a safe policy... The dataset is a mixture of 50% transitions collected by the safe policy and 50% collected by the unsafe policy. Each dataset contains 2e6 samples.' However, it does not provide concrete access information (e.g., a link or specific citation for public availability) for this collected dataset.
Dataset Splits Yes Each agent is trained for 0.5 million steps and evaluated on 10 evaluation episodes (which were separate from the train distribution) after every 5000 iterations
Hardware Specification No The paper does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation.
Experiment Setup Yes Implementation details and hyperparameter choices can be found in Appendix B.