Kernel-Based Reinforcement Learning in Robust Markov Decision Processes

Authors: Shiau Hong Lim, Arnaud Autef

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that the better performance bound does translate into solutions that perform better, especially when there is a model mismatch between the training and the testing environments.
Researcher Affiliation Collaboration 1IBM Research, Singapore 2Applied Mathematics department, Ecole polytechnique, France. Work accomplished while working at IBM Research, Singapore.
Pseudocode Yes Algorithm 1 Robust kernel-based value iteration... Algorithm 2 Robust kernel-based value iteration, II
Open Source Code Yes The complete source code for the implementation of our algorithm as well as the task environments are provided in the supplementary material.
Open Datasets No For Puddle World, ... We follow the strategy of (Barreto et al., 2016) in creating the training set Da by running a random policy on 10 training episodes... The representative states for φ are then created by running K-means on the training states.
Dataset Splits No The paper mentions 'best-performing training set and kernel parameters are chosen' but does not specify a distinct validation split (e.g., 80/10/10 or similar percentages/counts) for hyperparameter tuning or early stopping.
Hardware Specification No The paper does not provide specific hardware details (like CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions 'Gaussian kernel' and '4-th order Runge-Kutta method' for simulation but does not list specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup Yes Our value iteration is stopped when wt+1 wt < 0.001 or after 100 iterations, whichever happens earlier. We use γ = 0.99 for all our tasks. ...For the bandwidth parameters, we employ a wide range during training, from the set {exp( 8), exp( 7) . . . exp(3)}. This results in 144 pairs of (σψ, σφ), and we always choose the best-performing pair based on 30 independent test episodes.