Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Convergence of a Q-learning Variant for Continuous States and Actions
Authors: S. W. Carden
JAIR 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a reinforcement learning algorithm for solving infinite horizon Markov Decision Processes under the expected total discounted reward criterion when both the state and action spaces are continuous. This algorithm is based on Watkins Q-learning, but uses Nadaraya-Watson kernel smoothing to generalize knowledge to unvisited states. As expected, continuity conditions must be imposed on the mean rewards and transition probabilities. Using results from kernel regression theory, this algorithm is proven capable of producing a Q-value function estimate that is uniformly within an arbitrary tolerance of the true Q-value function with probability one. The algorithm is then applied to an example problem to empirically show convergence as well. |
| Researcher Affiliation | Academia | Stephen Carden EMAIL Department of Mathematical Sciences, Clemson University |
| Pseudocode | Yes | Algorithm 1 Pseudocode for theoretical algorithm Initialize h = bandwidth value, m = maximum iterations, γ = discount factor, ϵ = exploration parameter Initialize b Qh,0(s, a) = 0 (s, a) Set initial state s1 for i=1:m do r = Uniform(0,1) random value if r < ϵ then ai = random action else ai = supa A b Qh,n 1(si, a) end if ui = next state, ri = reward yh,i := ri + γ supa A b Qh,i 1(ui, a) b Qh,i(s, a) = Pi j=1 Kh((s,a) (sj,aj))yh,j Pi j=1 Kh((s,a) (sj,aj)) si+1 = ui end for |
| Open Source Code | Yes | 1. For the full source code, see the online appendix associated with this publication. |
| Open Datasets | Yes | In this section we detail an application to the Mountain Car problem (Moore, 1991). |
| Dataset Splits | No | The paper describes the setup of the Mountain Car problem, which is a simulation environment, but it does not specify any training, validation, or test dataset splits in the traditional sense of partitioning a pre-collected dataset. |
| Hardware Specification | Yes | Implementation was in MATLAB 2012b in Ubuntu 12.04 on hardware with an Intel Xeon 3.47 gigahertz processor and 24 gigabytes of RAM. |
| Software Dependencies | Yes | Implementation was in MATLAB 2012b in Ubuntu 12.04 on hardware with an Intel Xeon 3.47 gigahertz processor and 24 gigabytes of RAM. |
| Experiment Setup | Yes | Parameter values were bandwidth h = .2, exploration parameter ϵ = .9, discount factor γ = .9, and k = 20 successful episodes to initialize with 1. |