Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring Safer Behaviors for Deep Reinforcement Learning

Authors: Enrico Marchesini, Davide Corsi, Alessandro Farinelli7701-7709

AAAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evidence on the Safety Gym benchmark shows that we successfully avoid drawbacks on the return while improving the safety of the policy. We compare a SOS implementation of PPO (Dhariwal et al. 2017) and TD3 (Fujimoto, van Hoof, and Meger 2018) over constrained approaches, namely CPO (Achiam et al. 2017), Lagrangian-PPO (Stooke, Achiam, and Abbeel 2020), and IPO (Liu, Ding, and Liu 2020), in the recent Safety Gym benchmarks (Ray, Achiam, and Amodei 2019).
Researcher Affiliation	Academia	Enrico Marchesini, Davide Corsi, Alessandro Farinelli Department of Computer Science, University of Verona EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Safety-Oriented Search
Open Source Code	No	The information is insufficient. The paper does not provide an explicit statement of open-sourcing their code or a link to a repository for the methodology described.
Open Datasets	Yes	in the recent Safety Gym benchmarks (Ray, Achiam, and Amodei 2019). We consider six tasks recommended by the authors of Safety Gym as a benchmark for our class of problems.
Dataset Splits	No	The information is insufficient. The paper does not provide specific dataset split information (exact percentages, sample counts, or citations to predefined splits) needed to reproduce data partitioning into training, validation, and test sets.
Hardware Specification	Yes	Data are collected on a RTX 2080, using the hyperparameters reported in the supplemental material.
Software Dependencies	No	The information is insufficient. The paper mentions various algorithms and a verification tool (Neurify) but does not provide specific version numbers for any general software dependencies or libraries.
Experiment Setup	No	The information is insufficient. The paper states that "Data are collected on a RTX 2080, using the hyperparameters reported in the supplemental material.", indicating that specific experimental setup details like hyperparameters are not included in the main text.