Safe Model-based Reinforcement Learning with Stability Guarantees

Authors: Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down.
Researcher Affiliation Academia Felix Berkenkamp Department of Computer Science ETH Zurich befelix@inf.ethz.ch Matteo Turchetta Department of Computer Science, ETH Zurich matteotu@inf.ethz.ch Angela P. Schoellig Institute for Aerospace Studies University of Toronto schoellig@utias.utoronto.ca Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch
Pseudocode Yes Algorithm 1 SAFELYAPUNOVLEARNING
Open Source Code Yes A Python implementation of Algorithm 1 and the experiments based on Tensor Flow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning.
Open Datasets No The paper describes using a 'simulated inverted pendulum benchmark problem' and its dynamics, but does not provide a link, DOI, or formal citation for a publicly available or open dataset.
Dataset Splits No The paper describes a simulated environment and does not specify training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper states that experiments were run on a 'simulated inverted pendulum' and mentions using 'TensorFlow' but does not provide any specific hardware details such as CPU/GPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions 'Tensor Flow [37] and GPflow [38]' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For the policy, we use a neural network with two hidden layers and 32 neurons with Re LU activations each. We compute a conservative estimate of the Lipschitz constant as in [30]. We use standard approximate dynamic programming with a quadratic, normalized cost r(x, u) = x TQx + u TRu, where Q and R are positive-definite, to compute the cost-to-go Jπθ. Specifically, we use a piecewiselinear triangulation of the state-space as to approximate Jπθ, see [39]. We optimize the policy via stochastic gradient descent on (7), where we sample a finite subset of X and replace the integral in (7) with a sum. We verify our approach on an inverted pendulum benchmark problem. The true, continuous-time dynamics are given by ml2 ψ = gml sin(ψ) λ ψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum. We use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction. We use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors. To enable more data-efficient learning, we fix βn = 2.