Safety through feedback in Constrained RL

Authors: Shashank Reddy Chirra, Pradeep Varakantham, Praveen Paruchuri

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates nearoptimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach. The code to replicate these results can be found at https://github.com/shshnkreddy/RLSF
Researcher Affiliation Academia Shashank Reddy Chirra 1, Pradeep Varakantham1, Praveen Paruchuri2 1Singapore Management University, 2IIIT Hyderabad {shashankc,pradeepv}@smu.edu.sg, praveen.p@iiit.ac.in
Pseudocode Yes Algorithm 1 Reinforcement Learning from Safety Feedback (RLSF)
Open Source Code Yes The code to replicate these results can be found at https://github.com/shshnkreddy/RLSF
Open Datasets Yes We evaluate RLSF on multiple continuous control benchmarks in the Safety Gymnasium environment [17] and Mujoco [30] based environments introduced in [22].
Dataset Splits No The paper discusses 'train', 'validation', and 'test' in the context of policy and model evaluation, but does not provide specific data splits (e.g., percentages or counts) for any dataset used to allow reproduction of the data partitioning.
Hardware Specification Yes We conducted the experiments on a cluster quipped with 4 NVIDIA RTX A5000 GPUs and 96 core CPUs.
Software Dependencies No The paper mentions software components like 'PPO-Lagrangian algorithm [3]' and 'Sim Hash [10]' but does not provide specific version numbers for any software or libraries used in the implementation, which is necessary for reproducible software dependencies.
Experiment Setup Yes The detailed hyperparameters utilized in the experiments are presented in Table 3. All results presented in both the main paper and the appendix are based on three independent seeds, with the mean and standard error reported, unless explicitly stated otherwise.