reproducibilityindex.ai

Off-Policy Primal-Dual Safe Reinforcement Learning

Authors: Zifan Wu, Bo Tang, Qian Lin, Chao Yu, Shangqin Mao, Qianlong Xie, Xingxing Wang, Dong Wang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Results on benchmark tasks show that our method not only achieves an asymptotic performance comparable to state-of-the-art on-policy methods while using much fewer samples, but also significantly reduces constraint violation during training. Our code is available at https://github.com/Zifan Wu/CAL. We evaluate our method on a real-world auto-bidding task under the semi-batch training paradigm (Matsushima et al., 2021), where the behavior policy is not allowed to update within each long-term data collecting process. Results verify the effectiveness of our method in such scenarios by conservatively approaching the optimal policy.
Researcher Affiliation	Collaboration	1Sun Yat-sen University, Guangzhou, China {wuzf5,linq67}@mail2.sysu.edu.cn, yuchao3@mail.sysu.edu.cn 2Institute for Advanced Algorithms Research, Shanghai, China tangb@iaar.ac.cn 3Meituan, Beijing, China {maoshangqin,xieqianlong,wangxingxing04,wangdong07}@meituan.com
Pseudocode	Yes	C PSEUDO CODE The pseudo code of CAL is presented in Algorithm 1. Algorithm 1 CAL
Open Source Code	Yes	Our code is available at https://github.com/Zifan Wu/CAL.
Open Datasets	Yes	We conduct our comparative evaluation on the Safety-Gym benchmark (Ray et al., 2019) and the velocity-constrained Mu Jo Co benchmark (Zhang et al., 2020).
Dataset Splits	No	The paper discusses 'training' and 'testing' phases for its experiments, as seen in Figure 2 and 3 and relevant text ('training cost', 'test reward'). However, it does not explicitly specify a 'validation' dataset split or its use in the experimental setup in the main text or appendices for its own experiments. The term 'validation' appears only in the context of the JSON schema, not in the paper's experimental description.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper mentions using Adam for optimization and refers to various existing RL methods and frameworks (e.g., SAC, omnisafe), but it does not specify version numbers for any programming languages, libraries, or other software dependencies required to reproduce the experiments.
Experiment Setup	Yes	The network structures of the actor and the reward/cost critics of CAL are all the same with the off-policy baselines, i.e., 256 neurons for two hidden layers with Re LU activation. The discount factor γ is set to 0.99 both in reward and cost value estimations. The optimization of the networks is conducted by Adam with the learning rate 5e 4 for the cost critics, and 3e 4 for the actor and the reward critics. The ensemble size E is set to 4 for Mu Jo Co tasks and 6 for Safety-Gym tasks. The conservatism parameter k is set to 0.5 for all tasks except for Point Push1 (0.8). The convexity parameter c is set to 10 for all tasks except for Ant (100), Half Cheetah (1000) and Humanoid (1000). The UTD ratio of CAL is set to 20 for all tasks except for Humanoid (10) and Half Cheetah (40).