Off-Policy Primal-Dual Safe Reinforcement Learning

Authors: Zifan Wu, Bo Tang, Qian Lin, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/Zifan Wu/CAL. We evaluate our method on a real-world auto-bidding task under the semi-batch training paradigm (Matsushima et al., 2021), where the behavior policy is not allowed to update within each long-term data collecting process. Results verify the effectiveness of our method in such scenarios by conservatively approaching the optimal policy.
Researcher Affiliation Collaboration 1Sun Yat-sen University, Guangzhou, China {wuzf5,linq67}@mail2.sysu.edu.cn, yuchao3@mail.sysu.edu.cn 2Institute for Advanced Algorithms Research, Shanghai, China tangb@iaar.ac.cn 3Meituan, Beijing, China {maoshangqin,xieqianlong,wangxingxing04,wangdong07}@meituan.com
Pseudocode Yes C PSEUDO CODE The pseudo code of CAL is presented in Algorithm 1. Algorithm 1 CAL
Open Source Code Yes Our code is available at https://github.com/Zifan Wu/CAL.
Open Datasets Yes We conduct our comparative evaluation on the Safety-Gym benchmark (Ray et al., 2019) and the velocity-constrained Mu Jo Co benchmark (Zhang et al., 2020).
Dataset Splits No The paper discusses 'training' and 'testing' phases for its experiments, as seen in Figure 2 and 3 and relevant text ('training cost', 'test reward'). However, it does not explicitly specify a 'validation' dataset split or its use in the experimental setup in the main text or appendices for its own experiments. The term 'validation' appears only in the context of the JSON schema, not in the paper's experimental description.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions using Adam for optimization and refers to various existing RL methods and frameworks (e.g., SAC, omnisafe), but it does not specify version numbers for any programming languages, libraries, or other software dependencies required to reproduce the experiments.
Experiment Setup Yes The network structures of the actor and the reward/cost critics of CAL are all the same with the off-policy baselines, i.e., 256 neurons for two hidden layers with Re LU activation. The discount factor γ is set to 0.99 both in reward and cost value estimations. The optimization of the networks is conducted by Adam with the learning rate 5e 4 for the cost critics, and 3e 4 for the actor and the reward critics. The ensemble size E is set to 4 for Mu Jo Co tasks and 6 for Safety-Gym tasks. The conservatism parameter k is set to 0.5 for all tasks except for Point Push1 (0.8). The convexity parameter c is set to 10 for all tasks except for Ant (100), Half Cheetah (1000) and Humanoid (1000). The UTD ratio of CAL is set to 20 for all tasks except for Humanoid (10) and Half Cheetah (40).