Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Safety-Polarized and Prioritized Reinforcement Learning
Authors: Ke Fan, Jinpeng Zhang, Xuefeng Zhang, Yunze Wu, Jingyu Cao, Yuan Zhou, Jianzhu Ma
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on autonomous driving and safe control tasks, demonstrating that our proposed algorithms, SPOM and SPOM PER, achieve superior safety and the best reward-safety trade-off among state-of-the-art safe RL methods (Section 6). Section 6 is titled 'Experiments' and contains '6.1. Experiment Setup', '6.2. Main Results', and '6.3. Ablation Studies' with tables and figures of empirical results. |
| Researcher Affiliation | Academia | 1Department of Mathematical Sciences, Tsinghua University, Beijing, China 2Institute for Artificial Intelligence, Peking University, Beijing, China 3Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 4Yau Mathematical Sciences Center, Tsinghua University, Beijing, China 5Beijing Institute of Mathematical Sciences and Applications, Beijing, China 6Department of Electronic Engineering, Tsinghua University, Beijing, China 7Institute for AI Industry Research, Tsinghua University, Beijing, China. Correspondence to: Yuan Zhou <EMAIL>, Jianzhu Ma <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Safety-Polarized Optimal action Masks with Prioritized Experience Replay (SPOM PER) |
| Open Source Code | Yes | Code for the experiments is available at https://github. com/Frank Sinatral/Safety-PP.git. |
| Open Datasets | Yes | Our evaluation adopts the following four tasks: Two Way, Merge, Roundabout, and Intersection. These tasks are from the highway-env environment (Leurent, 2018; Leurent & Mercat, 2019), designed for simulated autonomous driving with diverse objectives that require intricate behaviors to safely achieve the corresponding goals. Additionally, our evaluation includes classical safe control tasks such as Adaptive Cruise Control (ACC) (Anderson et al., 2020) and Circle (Achiam et al., 2017). |
| Dataset Splits | No | The paper uses reinforcement learning environments (highway-env, ACC, Circle) rather than static datasets with explicit train/test/validation splits. It mentions training steps and evaluation metrics averaged over the last 1/10 training steps, but no specific dataset partitioning for reproduction is provided. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using algorithms like DQN and PPO, and environments like highway-env, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, or specific environment versions). |
| Experiment Setup | Yes | Table 2. Hyperparameters for all DQN-based algorithms: optimizer Adam, discount factor 0.99, Q-network learning rate 5e-4, batch size 64, update every 1, initial epsilon-greedy exploration rate 1, epsilon decay 0.995, epsilon min 0.01, number of random seeds 6. SPOM (ours) SA-REF psi learning rate 5e-4, polarization function fpol(x) = 10 log(x). SPOM PER (ours) priority exponent alpha 0.6, importance sampling exponent theta 0.4. Table 3. Hyperparameters for PPO-based algorithms: optimizer Adam, discount factor 0.99, learning rates of actor and critic 3e-4, GAE parameter 0.97, clip ratio 0.2. RESPO REF learning rate 1e-4, Lagrange multiplier learning rate 5e-5. PPOLag Lagrange multiplier learning rate 0.001. |