reproducibilityindex.ai

Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies

Authors: Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, Peter J Ramadge

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments on ﬁve different control tasks, our algorithm consistently outperforms several state-of-the-art baselines, achieving 10 times fewer constraint violations and 40% higher reward on average.
Researcher Affiliation	Collaboration	1Princeton University 2Siemens Corporation, Corporate Technology.
Pseudocode	Yes	Algorithm 1 SPACE
Open Source Code	Yes	1Code is available at: https://sites.google.com/ view/spacealgo
Open Datasets	No	The paper mentions tasks like Mujoco, real-world traffic management, and car-racing, and uses human demonstration data, but does not provide concrete access information (links, DOIs, formal citations) for any public datasets used for training.
Dataset Splits	No	The paper describes experimental tasks but does not provide specific details on dataset splits (e.g., percentages, sample counts, or explicit validation sets).
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'neural networks to represent Gaussian policies' and different projection types but does not specify versions for any software dependencies or libraries.
Experiment Setup	Yes	For all the algorithms, we use neural networks to represent Gaussian policies. We use the KL-divergence projection in the Mujoco and car-racing tasks, and the 2-norm projection in the trafﬁc management task since it achieves better performance. We use a grid-search to select for the hyper-parameters. See the supplementary material for more experimental details. ... The weight is ﬁxed and it is set to 1. ... The d-CPO update solves f-CPO problem with a stateful λk+1 = (λk) β, where 0 < β < 1.