Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Constraints Penalized Q-learning for Safe Offline Reinforcement Learning
Authors: Haoran Xu, Xianyuan Zhan, Xiangyu Zhu8753-8760
AAAI 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a theoretical analysis and demonstrate empirically that our approach can learn robustly across a variety of benchmark control tasks, outperforming several baselines. Through systematic experiments, we show that our algorithm can learn robustly to maximize rewards while successfully satisfying safety constraints, outperform all baselines in benchmark continuous control tasks. We conducted experiments on three Mujoco tasks: Hopper-v2, Half Cheetah-v2 and Walker2d-v2. |
| Researcher Affiliation | Collaboration | 1 School of Computer Science and Technology, Xidian University, Xi an, China 2 JD i City, JD Technology, Beijing, China 3 JD Intelligent Cities Research 4 Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China |
| Pseudocode | Yes | The pseudo-code of CPQ is presented in Algorithm 1. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper states, 'For each environment, we collect data using a safe policy... The dataset is a mixture of 50% transitions collected by the safe policy and 50% collected by the unsafe policy. Each dataset contains 2e6 samples.' However, it does not provide concrete access information (e.g., a link or specific citation for public availability) for this collected dataset. |
| Dataset Splits | Yes | Each agent is trained for 0.5 million steps and evaluated on 10 evaluation episodes (which were separate from the train distribution) after every 5000 iterations |
| Hardware Specification | No | The paper does not specify any details regarding the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | Implementation details and hyperparameter choices can be found in Appendix B. |