Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies

Authors: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J Ramadge

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments on five different control tasks, our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average.
Researcher Affiliation Collaboration 1Princeton University 2Siemens Corporation, Corporate Technology.
Pseudocode Yes Algorithm 1 SPACE
Open Source Code Yes 1Code is available at: https://sites.google.com/ view/spacealgo
Open Datasets No The paper mentions tasks like Mujoco, real-world traffic management, and car-racing, and uses human demonstration data, but does not provide concrete access information (links, DOIs, formal citations) for any public datasets used for training.
Dataset Splits No The paper describes experimental tasks but does not provide specific details on dataset splits (e.g., percentages, sample counts, or explicit validation sets).
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'neural networks to represent Gaussian policies' and different projection types but does not specify versions for any software dependencies or libraries.
Experiment Setup Yes For all the algorithms, we use neural networks to represent Gaussian policies. We use the KL-divergence projection in the Mujoco and car-racing tasks, and the 2-norm projection in the traffic management task since it achieves better performance. We use a grid-search to select for the hyper-parameters. See the supplementary material for more experimental details. ... The weight is fixed and it is set to 1. ... The d-CPO update solves f-CPO problem with a stateful λk+1 = (λk) β, where 0 < β < 1.