Conservative Safety Critics for Exploration
Authors: Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, Animesh Garg
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.com/view/conservative-safety-critics/ |
| Researcher Affiliation | Academia | 1University of Toronto, Vector Institute 2University of California Berkeley |
| Pseudocode | Yes | Algorithm 1 CSC: safe exploration with conservative safety critics |
| Open Source Code | No | The paper mentions a URL for videos (https: //sites.google.com/view/conservative-safety-critics/) but does not provide a link to its own source code or explicitly state that its code is open-source. |
| Open Datasets | No | The paper describes custom simulated environments (Point agent, Car, Panda push, Laikago) built on frameworks like Robosuite and PyBullet, and mentions seeding a replay buffer with 1000 tuples, but it does not provide concrete access information (link, DOI, formal citation for a specific public dataset used directly) for any dataset. |
| Dataset Splits | No | The paper discusses collecting on-policy samples and using a replay buffer for training but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | No | The paper thanks 'Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support,' but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions using Robosuite, Py Bullet, and TensorFlow but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We chose the learning rate ηQ for the safety-critic QC to be 2e 4 after experimenting with 1e 4 and 2e 4 and observing slightly better results with the latter. The value of discount factor γ is set to the usual default value 0.99, the learning rate ηλ of the dual variable λ is set to 4e 2, the value of δ for the DKL constraint on policy updates is set to 0.01, and the value of α to be 0.5. We experimented with three different α values 0.05, 0.5, 5 and found nearly same performance across these three values. For policy updates, the backtracking co-efficient β(0) is set to 0.7 and the max. number of line search iterations L = 20. For the Q-ensembles baseline, the ensemble size is chosen to be 20 (as mentioned in the LNT paper), with the rest of the common hyper-parameter values consistent with CSC, for a fair comparison.All results are over four random seeds. |