Kernel-Based Reinforcement Learning in Robust Markov Decision Processes
Authors: Shiau Hong Lim, Arnaud Autef
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate that the better performance bound does translate into solutions that perform better, especially when there is a model mismatch between the training and the testing environments. |
| Researcher Affiliation | Collaboration | 1IBM Research, Singapore 2Applied Mathematics department, Ecole polytechnique, France. Work accomplished while working at IBM Research, Singapore. |
| Pseudocode | Yes | Algorithm 1 Robust kernel-based value iteration... Algorithm 2 Robust kernel-based value iteration, II |
| Open Source Code | Yes | The complete source code for the implementation of our algorithm as well as the task environments are provided in the supplementary material. |
| Open Datasets | No | For Puddle World, ... We follow the strategy of (Barreto et al., 2016) in creating the training set Da by running a random policy on 10 training episodes... The representative states for φ are then created by running K-means on the training states. |
| Dataset Splits | No | The paper mentions 'best-performing training set and kernel parameters are chosen' but does not specify a distinct validation split (e.g., 80/10/10 or similar percentages/counts) for hyperparameter tuning or early stopping. |
| Hardware Specification | No | The paper does not provide specific hardware details (like CPU/GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Gaussian kernel' and '4-th order Runge-Kutta method' for simulation but does not list specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x). |
| Experiment Setup | Yes | Our value iteration is stopped when wt+1 wt < 0.001 or after 100 iterations, whichever happens earlier. We use γ = 0.99 for all our tasks. ...For the bandwidth parameters, we employ a wide range during training, from the set {exp( 8), exp( 7) . . . exp(3)}. This results in 144 pairs of (σψ, σφ), and we always choose the best-performing pair based on 30 independent test episodes. |