Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Saute RL: Almost Surely Safe Reinforcement Learning Using State Augmentation

Authors: Aivar Sootla, Alexander I Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David H Mguni, Jun Wang, Haitham Ammar

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Experiments Environments. We demonstrate the advantages and the limitations of our approach on three Open AI gym environments with safety constraints (pendulum swing-up, double pendulum balancing, reacher) and the Open AI safety gym environment (schematically depicted in Figure 1). ... Implementation. The main benefit of our approach to safe RL is the ability to sauté any RL algorithm. This is because we do not need to change the algorithm itself (besides some cosmetic changes), but create a wrapper around the environment. ... Evaluation protocols. In all our experiments we used 5 different seeds, we save the intermediate policies and evaluate them on 100 different trajectories in all our figures and tables.
Researcher Affiliation	Collaboration	1 Huawei R&D UK. 2 Technische Universität Darmstadt. 3 University College London 4 Honorary Lecturer at University College London.
Pseudocode	Yes	def safety_step(self, cost: np.ndarray) -> np.ndarray: """ Update the normalized safety state """ # subtract the normalized cost self._safe_state -= cost / self.safe_budget # normalize by the discount factor self._safe_state /= self.safe_discount_factor return self._safe_state
Open Source Code	Yes	Our implementations are available online (Sootla et al., 2022). URL https://github.com/huawei-noah/HEBO/tree/master/SAUTE.
Open Datasets	Yes	We take the single pendulum swing-up from the classic control library in the Open AI Gym (Brockman et al., 2016). We take the double pendulum stabilization implementation by (Todorov et al., 2012) using the Open AI Gym (Brockman et al., 2016) interface (the environment Inverted Double Pendulum Env from gym.envs.mujoco). We take the reacher implementation by (Todorov et al., 2012) using the Open AI Gym (Brockman et al., 2016) interface (Reacher).
Dataset Splits	No	The paper discusses training and evaluating policies but does not provide specific details on dataset splits for training, validation, and testing as commonly understood in supervised learning. The data is generated through interaction with the environment.
Hardware Specification	No	The paper discusses the software frameworks and environments used (e.g., Open AI gym, Mujoco) and various RL algorithms (PPO, TRPO, SAC, MBPO, PETS), but does not specify any details about the computational hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies	Yes	We used safety starter agents (Ray et al., 2019) (Tensorflow == 1.13.1) as the core implementation for model-free methods, their Lagrangian versions (PPO, TRPO, SAC, CPO). We have tested some environments using stable baselines library (Raffin et al., 2021) (Py Torch >= 1.8.1) and did not find drastic performance differences.
Experiment Setup	Yes	We take the default parameters presented in Tables A1 and A2. Common parameters: Network architecture [64,64] Activation tahn Value function learning rate 1e-3 Task Discount Factor 0.99 Lambda 0.97 N samples per epochs 1000 N gradient steps 80 Target KL 0.01 Penalty learning rate 5e-2 Safety Discount Factor 0.99 Safety Lambda 0.97 Initial penalty 1 Clip ratio 0.2 Policy learning rate 3e-4 Policy iterations 80 KL margin 1.2 Damping Coefficient 0.1 Backtrack Coefficient 0.8 Backtrack iterations 10 Learning Margin False