Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies
Authors: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J Ramadge
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average. |
| Researcher Affiliation | Collaboration | 1Princeton University 2Siemens Corporation, Corporate Technology. |
| Pseudocode | Yes | Algorithm 1 SPACE |
| Open Source Code | Yes | 1Code is available at: https://sites.google.com/ view/spacealgo |
| Open Datasets | No | The paper mentions tasks like Mujoco, real-world traffic management, and car-racing, and uses human demonstration data, but does not provide concrete access information (links, DOIs, formal citations) for any public datasets used for training. |
| Dataset Splits | No | The paper describes experimental tasks but does not provide specific details on dataset splits (e.g., percentages, sample counts, or explicit validation sets). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'neural networks to represent Gaussian policies' and different projection types but does not specify versions for any software dependencies or libraries. |
| Experiment Setup | Yes | For all the algorithms, we use neural networks to represent Gaussian policies. We use the KL-divergence projection in the Mujoco and car-racing tasks, and the 2-norm projection in the traffic management task since it achieves better performance. We use a grid-search to select for the hyper-parameters. See the supplementary material for more experimental details. ... The weight is fixed and it is set to 1. ... The d-CPO update solves f-CPO problem with a stateful λk+1 = (λk) β, where 0 < β < 1. |