Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Safety-Polarized and Prioritized Reinforcement Learning

Authors: Ke Fan, Jinpeng Zhang, Xuefeng Zhang, Yunze Wu, Jingyu Cao, Yuan Zhou, Jianzhu Ma

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on autonomous driving and safe control tasks, demonstrating that our proposed algorithms, SPOM and SPOM PER, achieve superior safety and the best reward-safety trade-off among state-of-the-art safe RL methods (Section 6). Section 6 is titled 'Experiments' and contains '6.1. Experiment Setup', '6.2. Main Results', and '6.3. Ablation Studies' with tables and figures of empirical results.
Researcher Affiliation	Academia	1Department of Mathematical Sciences, Tsinghua University, Beijing, China 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 4Yau Mathematical Sciences Center, Tsinghua University, Beijing, China 5Beijing Institute of Mathematical Sciences and Applications, Beijing, China 6Department of Electronic Engineering, Tsinghua University, Beijing, China 7Institute for AI Industry Research, Tsinghua University, Beijing, China. Correspondence to: Yuan Zhou <EMAIL>, Jianzhu Ma <EMAIL>.
Pseudocode	Yes	Algorithm 1 Safety-Polarized Optimal action Masks with Prioritized Experience Replay (SPOM PER)
Open Source Code	Yes	Code for the experiments is available at https://github. com/Frank Sinatral/Safety-PP.git.
Open Datasets	Yes	Our evaluation adopts the following four tasks: Two Way, Merge, Roundabout, and Intersection. These tasks are from the highway-env environment (Leurent, 2018; Leurent & Mercat, 2019), designed for simulated autonomous driving with diverse objectives that require intricate behaviors to safely achieve the corresponding goals. Additionally, our evaluation includes classical safe control tasks such as Adaptive Cruise Control (ACC) (Anderson et al., 2020) and Circle (Achiam et al., 2017).
Dataset Splits	No	The paper uses reinforcement learning environments (highway-env, ACC, Circle) rather than static datasets with explicit train/test/validation splits. It mentions training steps and evaluation metrics averaged over the last 1/10 training steps, but no specific dataset partitioning for reproduction is provided.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using algorithms like DQN and PPO, and environments like highway-env, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific environment versions).
Experiment Setup	Yes	Table 2. Hyperparameters for all DQN-based algorithms: optimizer Adam, discount factor 0.99, Q-network learning rate 5e-4, batch size 64, update every 1, initial epsilon-greedy exploration rate 1, epsilon decay 0.995, epsilon min 0.01, number of random seeds 6. SPOM (ours) SA-REF psi learning rate 5e-4, polarization function fpol(x) = 10 log(x). SPOM PER (ours) priority exponent alpha 0.6, importance sampling exponent theta 0.4. Table 3. Hyperparameters for PPO-based algorithms: optimizer Adam, discount factor 0.99, learning rates of actor and critic 3e-4, GAE parameter 0.97, clip ratio 0.2. RESPO REF learning rate 1e-4, Lagrange multiplier learning rate 5e-5. PPOLag Lagrange multiplier learning rate 0.001.