Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Boundary-to-Region Supervision for Offline Safe Reinforcement Learning

Authors: Huikang Su, Dengyun Peng, Zifeng Zhuang, Yuhan Liu, Qiguang Chen, Donglin Wang, Qinghe Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL.
Researcher Affiliation Academia Huikang Su Dengyun Peng Zifeng Zhuang Yuhan Liu Qiguang Chen Donglin Wang Qinghe Liu Harbin Institute of Technology, Weihai, Harbin Institute of Technology, Harbin Westlake University, Hangzhou
Pseudocode Yes Algorithm 1 Boundary-to-Region Framework
Open Source Code Yes Our code is available at https://github.com/Huikang Su/B2R.
Open Datasets Yes Experiments are conducted on the DSRL benchmark [25], which includes 38 sequential decision-making tasks of varying difficulty. This suite provides a diverse and realistic testbed for safety-critical offline RL. Full environment details are in Appendix C.2. ... Safety Gymnasium [29]: A suite of Mujoco-based environments designed for safe reinforcement learning... Bullet Safety Gym [11]: Built on the Py Bullet physics engine... Meta Drive [22]: A self-driving simulator based on the Panda3D game engine...
Dataset Splits Yes To quantify the robustness of B2R to the sparsity of safe data, we conducted a data ablation study. We retrained B2R on subsets of the filtered safe dataset, sampled at 5%, 20%, 50%, and 100% of the originally available safe trajectories. The results on four representative tasks are shown in Table 4.
Hardware Specification Yes The experiments are conducted on a Linux-based server equipped with an Intel Core i9-14900K 32-Core Processor, one NVIDIA Ge Force RTX 4070 GPU, and 64 GB of RAM.
Software Dependencies Yes The implementation is based on Py Torch (v1.13.1) with CUDA 12.4.
Experiment Setup Yes All models are trained for 20 epochs, each consisting of 5000 gradient steps, totaling 100,000 training steps. To evaluate robustness, we use three random seeds for all experiments. Environment-specific cost thresholds are listed in Table 7, along with other key hyperparameters. ... Table 7: Hyperparameter settings for B2R experiments. Category Hyperparameter Value Optimizer Type Lamb Learning Rate 0.0001 Batch Size 2048 Gradient Clipping 0.25