Safe Model-based Reinforcement Learning with Stability Guarantees
Authors: Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, Andreas Krause
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down. |
| Researcher Affiliation | Academia | Felix Berkenkamp Department of Computer Science ETH Zurich befelix@inf.ethz.ch Matteo Turchetta Department of Computer Science, ETH Zurich matteotu@inf.ethz.ch Angela P. Schoellig Institute for Aerospace Studies University of Toronto schoellig@utias.utoronto.ca Andreas Krause Department of Computer Science ETH Zurich krausea@ethz.ch |
| Pseudocode | Yes | Algorithm 1 SAFELYAPUNOVLEARNING |
| Open Source Code | Yes | A Python implementation of Algorithm 1 and the experiments based on Tensor Flow [37] and GPflow [38] is available at https://github.com/befelix/safe_learning. |
| Open Datasets | No | The paper describes using a 'simulated inverted pendulum benchmark problem' and its dynamics, but does not provide a link, DOI, or formal citation for a publicly available or open dataset. |
| Dataset Splits | No | The paper describes a simulated environment and does not specify training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper states that experiments were run on a 'simulated inverted pendulum' and mentions using 'TensorFlow' but does not provide any specific hardware details such as CPU/GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions 'Tensor Flow [37] and GPflow [38]' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For the policy, we use a neural network with two hidden layers and 32 neurons with Re LU activations each. We compute a conservative estimate of the Lipschitz constant as in [30]. We use standard approximate dynamic programming with a quadratic, normalized cost r(x, u) = x TQx + u TRu, where Q and R are positive-definite, to compute the cost-to-go Jπθ. Specifically, we use a piecewiselinear triangulation of the state-space as to approximate Jπθ, see [39]. We optimize the policy via stochastic gradient descent on (7), where we sample a finite subset of X and replace the integral in (7) with a sum. We verify our approach on an inverted pendulum benchmark problem. The true, continuous-time dynamics are given by ml2 ψ = gml sin(ψ) λ ψ + u, where ψ is the angle, m the mass, g the gravitational constant, and u the torque applied to the pendulum. We use a GP model for the discrete-time dynamics, where the mean dynamics are given by a linearized and discretized model of the true dynamics that considers a wrong, lower mass and neglects friction. We use a combination of linear and Matérn kernels in order to capture the model errors that result from parameter and integration errors. To enable more data-efficient learning, we fix βn = 2. |