reproducibilityindex.ai

Kernel-Based Reinforcement Learning in Robust Markov Decision Processes

Authors: Shiau Hong Lim, Arnaud Autef

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate that the better performance bound does translate into solutions that perform better, especially when there is a model mismatch between the training and the testing environments.
Researcher Affiliation	Collaboration	1IBM Research, Singapore 2Applied Mathematics department, Ecole polytechnique, France. Work accomplished while working at IBM Research, Singapore.
Pseudocode	Yes	Algorithm 1 Robust kernel-based value iteration... Algorithm 2 Robust kernel-based value iteration, II
Open Source Code	Yes	The complete source code for the implementation of our algorithm as well as the task environments are provided in the supplementary material.
Open Datasets	No	For Puddle World, ... We follow the strategy of (Barreto et al., 2016) in creating the training set Da by running a random policy on 10 training episodes... The representative states for φ are then created by running K-means on the training states.
Dataset Splits	No	The paper mentions 'best-performing training set and kernel parameters are chosen' but does not specify a distinct validation split (e.g., 80/10/10 or similar percentages/counts) for hyperparameter tuning or early stopping.
Hardware Specification	No	The paper does not provide specific hardware details (like CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions 'Gaussian kernel' and '4-th order Runge-Kutta method' for simulation but does not list specific software dependencies with version numbers (e.g., Python 3.x, TensorFlow 2.x, PyTorch 1.x).
Experiment Setup	Yes	Our value iteration is stopped when wt+1 wt < 0.001 or after 100 iterations, whichever happens earlier. We use γ = 0.99 for all our tasks. ...For the bandwidth parameters, we employ a wide range during training, from the set {exp( 8), exp( 7) . . . exp(3)}. This results in 144 pairs of (σψ, σφ), and we always choose the best-performing pair based on 30 independent test episodes.