Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes
Authors: Akifumi Wachi, Yanan Sui, Yisong Yue, Masahiro Ono
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach on a range of experiments, including a simulation using the real Martian terrain data. We evaluate our approach on two problem settings. One is with randomly generated environments on which we perform Monte-Carlo simulations, and the other is with environments created from real Martian terrain data. |
| Researcher Affiliation | Collaboration | Akifumi Wachi University of Tokyo wachi@space.t.u-tokyo.ac.jp Yanan Sui and Yisong Yue Calfornia Institute of Technology {ysui, yyue}@caltech.edu Masahiro Ono Jet Propulsion Laboratory, California Institute of Technology ono@jpl.nasa.gov |
| Pseudocode | Yes | Algorithm 1 SAFEEXPOPT-MDP 1: loop 2: Observe reward and safety values of the current state, and update GP models 3: Update Ssafe t , Suncertain t , and Sunsafe t 4: Compute ˆJN for Optimistic MDP and JN for Pessimistic MDP by Eq. (5) and Eq. (6) 5: Derive the optimal policy to maximize η ˆJN + (1 η) JN by Eq. (8) 6: Execute the next optimal action 7: end loop |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described. |
| Open Datasets | Yes | The elevation map is derived from the digital terrain models (DEMs), created from Hi RISE camera on the Mars Reconnaissance Orbiter (Mc Ewen et al. 2007). |
| Dataset Splits | No | The paper refers to using datasets for simulations (randomly generated and Martian terrain data) and notes that 'GP hyper-parameters are obtained through trial and error.' However, it does not provide specific dataset split information (e.g., percentages, sample counts, or clear cross-validation setup) needed to reproduce the data partitioning into training, validation, or test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions algorithms and kernels used (e.g., GP with RBF kernel, Matern kernel, BEB algorithm) but does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | For each simulation, GP hyper-parameters are obtained through trial and error. As for the two Mars simulations, we refer to the previous work by (Turchetta, Berkenkamp, and Krause 2016) and use similar parameters. ... The exploring agent predicts the safety function g via GP with a Radial Basis Function (RBF) kernel with the lengthscales being 2.0 and the prior variance of safety being 1.5. ... The discount rate is set to 0.90 and beta is set as βt = 2, t 0. The Optimistic and Pessimistic MDPs are solved by the BEB algorithm with the weight coefficient of 2.5 for the exploration bonus. ... The weight coefficient between optimistic and pessimistic policies is set to η = 0.50. |