Conservative Safety Critics for Exploration

Authors: Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, Animesh Garg

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.com/view/conservative-safety-critics/
Researcher Affiliation Academia 1University of Toronto, Vector Institute 2University of California Berkeley
Pseudocode Yes Algorithm 1 CSC: safe exploration with conservative safety critics
Open Source Code No The paper mentions a URL for videos (https: //sites.google.com/view/conservative-safety-critics/) but does not provide a link to its own source code or explicitly state that its code is open-source.
Open Datasets No The paper describes custom simulated environments (Point agent, Car, Panda push, Laikago) built on frameworks like Robosuite and PyBullet, and mentions seeding a replay buffer with 1000 tuples, but it does not provide concrete access information (link, DOI, formal citation for a specific public dataset used directly) for any dataset.
Dataset Splits No The paper discusses collecting on-policy samples and using a replay buffer for training but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits).
Hardware Specification No The paper thanks 'Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support,' but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments.
Software Dependencies No The paper mentions using Robosuite, Py Bullet, and TensorFlow but does not provide specific version numbers for these or other ancillary software components.
Experiment Setup Yes We chose the learning rate ηQ for the safety-critic QC to be 2e 4 after experimenting with 1e 4 and 2e 4 and observing slightly better results with the latter. The value of discount factor γ is set to the usual default value 0.99, the learning rate ηλ of the dual variable λ is set to 4e 2, the value of δ for the DKL constraint on policy updates is set to 0.01, and the value of α to be 0.5. We experimented with three different α values 0.05, 0.5, 5 and found nearly same performance across these three values. For policy updates, the backtracking co-efficient β(0) is set to 0.7 and the max. number of line search iterations L = 20. For the Q-ensembles baseline, the ensemble size is chosen to be 20 (as mentioned in the LNT paper), with the rest of the common hyper-parameter values consistent with CSC, for a fair comparison.All results are over four random seeds.