Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Conservative Safety Critics for Exploration
Authors: Homanga Bharadhwaj, Aviral Kumar, Nicholas Rhinehart, Sergey Levine, Florian Shkurti, Animesh Garg
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that the proposed approach can achieve competitive task performance while incurring significantly lower catastrophic failure rates during training than prior methods. Videos are at this url https: //sites.google.com/view/conservative-safety-critics/ |
| Researcher Affiliation | Academia | 1University of Toronto, Vector Institute 2University of California Berkeley |
| Pseudocode | Yes | Algorithm 1 CSC: safe exploration with conservative safety critics |
| Open Source Code | No | The paper mentions a URL for videos (https: //sites.google.com/view/conservative-safety-critics/) but does not provide a link to its own source code or explicitly state that its code is open-source. |
| Open Datasets | No | The paper describes custom simulated environments (Point agent, Car, Panda push, Laikago) built on frameworks like Robosuite and PyBullet, and mentions seeding a replay buffer with 1000 tuples, but it does not provide concrete access information (link, DOI, formal citation for a specific public dataset used directly) for any dataset. |
| Dataset Splits | No | The paper discusses collecting on-policy samples and using a replay buffer for training but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits). |
| Hardware Specification | No | The paper thanks 'Vector Institute, Toronto and the Department of Computer Science, University of Toronto for compute support,' but does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running experiments. |
| Software Dependencies | No | The paper mentions using Robosuite, Py Bullet, and TensorFlow but does not provide specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | We chose the learning rate ηQ for the safety-critic QC to be 2e 4 after experimenting with 1e 4 and 2e 4 and observing slightly better results with the latter. The value of discount factor γ is set to the usual default value 0.99, the learning rate ηλ of the dual variable λ is set to 4e 2, the value of δ for the DKL constraint on policy updates is set to 0.01, and the value of α to be 0.5. We experimented with three different α values 0.05, 0.5, 5 and found nearly same performance across these three values. For policy updates, the backtracking co-efficient β(0) is set to 0.7 and the max. number of line search iterations L = 20. For the Q-ensembles baseline, the ensemble size is chosen to be 20 (as mentioned in the LNT paper), with the rest of the common hyper-parameter values consistent with CSC, for a fair comparison.All results are over four random seeds. |