Guiding Safe Exploration with Weakest Preconditions

Authors: Greg Anderson, Swarat Chaudhuri, Isil Dillig

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations.
Researcher Affiliation Academia Greg Anderson, Swarat Chaudhuri , Isil Dillig , Department of Computer Science The University of Texas at Austin Austin, TX, USA {ganderso, swarat, isil}@cs.utexas.edu
Pseudocode Yes Algorithm 1 The main learning algorithm
Open Source Code Yes 1SPICE is available at https://github.com/gavlegoat/spice.
Open Datasets Yes We test SPICE using the benchmarks considered in Anderson et al. (2020). [...] Our experiments are taken from Anderson et al. (2020), and consist of 10 environments with continuous state and action spaces.
Dataset Splits No The paper describes how data is gathered during the reinforcement learning process (e.g., 'We gather real data for 10 episodes for each model update then collect data from 70 simulated episodes'), but it does not specify explicit train/validation/test dataset splits with percentages or counts.
Hardware Specification No The paper states that 'Compute resources for the experiments were provided by the Texas Advanced Computing Center' but does not specify any particular GPU, CPU models, or other hardware components.
Software Dependencies No The paper mentions software components like 'Py Earth (Rudy, 2013)', 'CVXOPT (Anderson et al., 2022)', 'MBPO (Janner et al., 2019)', and 'Soft Actor-Critic (Haarnoja etol., 2018a)', but does not provide specific version numbers for these software packages.
Experiment Setup Yes Further details of the benchmarks and hyperparameters are given in Appendix C. [...] We gather real data for 10 episodes for each model update then collect data from 70 simulated episodes before updating the environment model again. We look five time steps into the future during safety analysis. Our SAC implementation (adapted from Tandon (2018)) uses automatic entropy tuning as proposed in Haarnoja et al. (2018b). Each training process is cut off after 48 hours. We train each benchmark starting from nine distinct seeds.