Saute RL: Almost Surely Safe Reinforcement Learning Using State Augmentation
Authors: Aivar Sootla, Alexander I Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David H Mguni, Jun Wang, Haitham Ammar
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments Environments. We demonstrate the advantages and the limitations of our approach on three Open AI gym environments with safety constraints (pendulum swing-up, double pendulum balancing, reacher) and the Open AI safety gym environment (schematically depicted in Figure 1). ... Implementation. The main benefit of our approach to safe RL is the ability to sauté any RL algorithm. This is because we do not need to change the algorithm itself (besides some cosmetic changes), but create a wrapper around the environment. ... Evaluation protocols. In all our experiments we used 5 different seeds, we save the intermediate policies and evaluate them on 100 different trajectories in all our figures and tables. |
| Researcher Affiliation | Collaboration | 1 Huawei R&D UK. 2 Technische Universität Darmstadt. 3 University College London 4 Honorary Lecturer at University College London. |
| Pseudocode | Yes | def safety_step(self, cost: np.ndarray) -> np.ndarray: """ Update the normalized safety state """ # subtract the normalized cost self._safe_state -= cost / self.safe_budget # normalize by the discount factor self._safe_state /= self.safe_discount_factor return self._safe_state |
| Open Source Code | Yes | Our implementations are available online (Sootla et al., 2022). URL https://github.com/huawei-noah/HEBO/tree/master/SAUTE. |
| Open Datasets | Yes | We take the single pendulum swing-up from the classic control library in the Open AI Gym (Brockman et al., 2016). We take the double pendulum stabilization implementation by (Todorov et al., 2012) using the Open AI Gym (Brockman et al., 2016) interface (the environment Inverted Double Pendulum Env from gym.envs.mujoco). We take the reacher implementation by (Todorov et al., 2012) using the Open AI Gym (Brockman et al., 2016) interface (Reacher). |
| Dataset Splits | No | The paper discusses training and evaluating policies but does not provide specific details on dataset splits for training, validation, and testing as commonly understood in supervised learning. The data is generated through interaction with the environment. |
| Hardware Specification | No | The paper discusses the software frameworks and environments used (e.g., Open AI gym, Mujoco) and various RL algorithms (PPO, TRPO, SAC, MBPO, PETS), but does not specify any details about the computational hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | Yes | We used safety starter agents (Ray et al., 2019) (Tensorflow == 1.13.1) as the core implementation for model-free methods, their Lagrangian versions (PPO, TRPO, SAC, CPO). We have tested some environments using stable baselines library (Raffin et al., 2021) (Py Torch >= 1.8.1) and did not find drastic performance differences. |
| Experiment Setup | Yes | We take the default parameters presented in Tables A1 and A2. Common parameters: Network architecture [64,64] Activation tahn Value function learning rate 1e-3 Task Discount Factor 0.99 Lambda 0.97 N samples per epochs 1000 N gradient steps 80 Target KL 0.01 Penalty learning rate 5e-2 Safety Discount Factor 0.99 Safety Lambda 0.97 Initial penalty 1 Clip ratio 0.2 Policy learning rate 3e-4 Policy iterations 80 KL margin 1.2 Damping Coefficient 0.1 Backtrack Coefficient 0.8 Backtrack iterations 10 Learning Margin False |