Constraints Penalized Q-learning for Safe Offline Reinforcement Learning
Authors: Haoran Xu, Xianyuan Zhan, Xiangyu Zhu8753-8760
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines. Through systematic experiments, we show that our algorithm can learn robustly to maximize rewards while successfully satisfying safety constraints, outperform all baselines in benchmark continuous control tasks. We conducted experiments on three Mujoco tasks: Hopper-v2, Half Cheetah-v2 and Walker2d-v2. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Technology, Xidian University, Xi an, China 2 JD i City, JD Technology, Beijing, China 3 JD Intelligent Cities Research 4 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China |
| Pseudocode | Yes | The pseudo-code of CPQ is presented in Algorithm 1. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper states, 'For each environment, we collect data using a safe policy... The dataset is a mixture of 50% transitions collected by the safe policy and 50% collected by the unsafe policy. Each dataset contains 2e6 samples.' However, it does not provide concrete access information (e.g., a link or specific citation for public availability) for this collected dataset. |
| Dataset Splits | Yes | Each agent is trained for 0.5 million steps and evaluated on 10 evaluation episodes (which were separate from the train distribution) after every 5000 iterations |
| Hardware Specification | No | The paper does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | Implementation details and hyperparameter choices can be found in Appendix B. |