Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Don’t Trade Off Safety: Diffusion Regularization for Constrained Offline RL

Authors: Junyu guo, Zhi Zheng, Donghao Ying, Ming Jin, Shangding Gu, Costas J Spanos, Javad Lavaei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that DRCORL achieves reliable safety performance, fast inference, and strong reward outcomes across robot learning tasks. Compared to existing safe offline RL methods, it consistently meets cost limits and performs well with the same hyperparameters, indicating practical applicability in real-world scenarios. We evaluate DRCORL on the DSRL benchmark [Liu et al., 2023a], comparing against state-of-the-art offline safe RL methods. Experiments show that DRCORL consistently attains higher rewards while satisfying safety constraints. Table 1 summarizes the normalized accumulated reward and cost per episode across tasks. Overall, our algorithm consistently achieves high rewards while reliably maintaining safety constraints. Notably, our method significantly outperforms baselines in tasks such as Car Goal1, Car Goal2, Point Goal1, Point Goal2, and Ball Circle, ensuring both optimal reward and constraint satisfaction. We evaluate our algorithm s performance under varying cost limits l = 10, 20, and 30, analyzing the learned policies behavior for each budget setting. The ablation results are presented in Figure 2 (b).
Researcher Affiliation Academia Junyu Guo University of California, Berkeley Zhi Zheng University of California, Berkeley Donghao Ying University of California, Berkeley Ming Jin Virginia Tech Shangding Gu University of California, Berkeley Costas Spanos University of California, Berkeley Javad Lavaei University of California, Berkeley
Pseudocode Yes With the procedures outlined above, we present our Algorithm 1, where the safe adaptation step is visualized in Figure 1 and detailed in Algorithm 2 given in Appendix B. Algorithm 1 DRCORL Algorithm 2 Gradient Manipulation Adaptation
Open Source Code No We open-source our implementation at https://github.com/James Junyu Guo/ DRCORL. Justification: We will release and open-source the code and training logs after the review process.
Open Datasets Yes We evaluate our method on the offline safe RL benchmark DSRL Liu et al. [2023a]. We conduct extensive experiments on Safety-Gymnasium Ray et al. [2019] and Bullet Safety-Gym Gronauer [2022].
Dataset Splits No In the offline setting, the agent cannot interact directly with the environment and instead relies solely on a static dataset Dµ = {(si, ai, ri, s i, ci)}N i=1 consisting of multiple transition tuples, which is collected using a behavioral policy πb(a|s). Results are averaged over three cost limit scales, 20 evaluation episodes and 5 random seeds.
Hardware Specification No Table 2: Summary of hyperparameter configurations for different algorithms. Hyperparameters ... Device Cuda ... Justification: We indicate the type of compute workers and the amount of compute required for each individual experimental runs in the Appendix C. We also disclose that the full research project require the same compute as reported in the paper.
Software Dependencies No Algorithms can be implemented using the open-source benchmark OSRLLiu et al. [2024].
Experiment Setup Yes We present the general hyperparameter setting in Table 2. For hyperparameters that do not apply to the corresponding algorithm, we use the back slash symbol \ to fill the blank. Table 2: Summary of hyperparameter configurations for different algorithms. Batch Size 512 ... Update Steps 100000 ... Actor Architecture(MLP) [256,256] ... Critic Architecture(MLP) \ [256,256] ... Actor Learning rate .001 ... Critic Learning rate \ .001 ... Episode Length 1000 ... γ 1.00 ... τ .005 ... h+ \ \ \ \ \ \ \ \ .2 h \ \ \ \ \ \ \ \ .2 PID \ [.1,.003,.001] [.1,.003,.001] \ \ \ \ \ [.1,.003,.001] E \ \ \ \ \ \ \ \ 4 k \ \ \ \ \ \ \ \ 2.0 α \ \ \ \ \ \ \ \ .2