Safety through feedback in Constrained RL
Authors: Shashank Reddy Chirra, Pradeep Varakantham, Praveen Paruchuri
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates nearoptimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach. The code to replicate these results can be found at https://github.com/shshnkreddy/RLSF |
| Researcher Affiliation | Academia | Shashank Reddy Chirra 1, Pradeep Varakantham1, Praveen Paruchuri2 1Singapore Management University, 2IIIT Hyderabad {shashankc,pradeepv}@smu.edu.sg, praveen.p@iiit.ac.in |
| Pseudocode | Yes | Algorithm 1 Reinforcement Learning from Safety Feedback (RLSF) |
| Open Source Code | Yes | The code to replicate these results can be found at https://github.com/shshnkreddy/RLSF |
| Open Datasets | Yes | We evaluate RLSF on multiple continuous control benchmarks in the Safety Gymnasium environment [17] and Mujoco [30] based environments introduced in [22]. |
| Dataset Splits | No | The paper discusses 'train', 'validation', and 'test' in the context of policy and model evaluation, but does not provide specific data splits (e.g., percentages or counts) for any dataset used to allow reproduction of the data partitioning. |
| Hardware Specification | Yes | We conducted the experiments on a cluster quipped with 4 NVIDIA RTX A5000 GPUs and 96 core CPUs. |
| Software Dependencies | No | The paper mentions software components like 'PPO-Lagrangian algorithm [3]' and 'Sim Hash [10]' but does not provide specific version numbers for any software or libraries used in the implementation, which is necessary for reproducible software dependencies. |
| Experiment Setup | Yes | The detailed hyperparameters utilized in the experiments are presented in Table 3. All results presented in both the main paper and the appendix are based on three independent seeds, with the mean and standard error reported, unless explicitly stated otherwise. |