reproducibilityindex.ai

Bounded Optimal Exploration in MDP

Authors: Kenji Kawaguchi

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose simple algorithms for discrete and continuous state spaces, and illustrate the beneﬁts of our proposed relaxation via theoretical analyses and numerical examples. Our algorithms also maintain anytime error bounds and average loss bounds. Our approach accommodates both Bayesian and non Bayesian methods. The paper includes sections like 'Experimental Example' and figures showing numerical results (Figure 1, 2, 3), confirming empirical evaluation.
Researcher Affiliation	Academia	Kenji Kawaguchi Massachusetts Institute of Technology Cambridge, MA, 02139 kawaguch@mit.edu
Pseudocode	Yes	Algorithm 1 Discrete PAC-RMDP and Algorithm 2 Linear PAC-RMDP.
Open Source Code	No	The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement) for the source code of the methodology described.
Open Datasets	Yes	We consider a ﬁve-state chain problem (Strens 2000), which is a standard toy problem in the literature. We consider two examples: the mountain car problem (Sutton and Barto 1998), which is a standard toy problem in the literature, and the HIV problem (Ernst et al. 2006), which originates from a real-world problem.
Dataset Splits	No	The paper mentions the number of runs or episodes for the experiments (e.g., 'average over 1000 runs', '100 episodes'), but it does not provide specific training/validation/test dataset split information.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	We use a discount factor of γ = 0.95 and a convergence criterion for the value iteration of ϵ = 0.01. We used δ = 0.9 for the PAC-MDP and PAC-RMDP algorithms. The ϵ-greedy algorithm is executed with ϵ = 0.1. In the planning phase, L is estimated as L maxs,s Ω\| V A(s) V A(s )\|/ s s , where Ω is the set of states that are visited in the planning phase (i.e., ﬁtted value iteration and a greedy roll-out method).