Bounded Optimal Exploration in MDP
Authors: Kenji Kawaguchi
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose simple algorithms for discrete and continuous state spaces, and illustrate the benefits of our proposed relaxation via theoretical analyses and numerical examples. Our algorithms also maintain anytime error bounds and average loss bounds. Our approach accommodates both Bayesian and non Bayesian methods. The paper includes sections like 'Experimental Example' and figures showing numerical results (Figure 1, 2, 3), confirming empirical evaluation. |
| Researcher Affiliation | Academia | Kenji Kawaguchi Massachusetts Institute of Technology Cambridge, MA, 02139 kawaguch@mit.edu |
| Pseudocode | Yes | Algorithm 1 Discrete PAC-RMDP and Algorithm 2 Linear PAC-RMDP. |
| Open Source Code | No | The paper does not provide any concrete access information (e.g., specific repository link, explicit code release statement) for the source code of the methodology described. |
| Open Datasets | Yes | We consider a five-state chain problem (Strens 2000), which is a standard toy problem in the literature. We consider two examples: the mountain car problem (Sutton and Barto 1998), which is a standard toy problem in the literature, and the HIV problem (Ernst et al. 2006), which originates from a real-world problem. |
| Dataset Splits | No | The paper mentions the number of runs or episodes for the experiments (e.g., 'average over 1000 runs', '100 episodes'), but it does not provide specific training/validation/test dataset split information. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We use a discount factor of γ = 0.95 and a convergence criterion for the value iteration of ϵ = 0.01. We used δ = 0.9 for the PAC-MDP and PAC-RMDP algorithms. The ϵ-greedy algorithm is executed with ϵ = 0.1. In the planning phase, L is estimated as L maxs,s Ω| V A(s) V A(s )|/ s s , where Ω is the set of states that are visited in the planning phase (i.e., fitted value iteration and a greedy roll-out method). |