Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convergence of a Q-learning Variant for Continuous States and Actions

Authors: S. W. Carden

JAIR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper presents a reinforcement learning algorithm for solving infinite horizon Markov Decision Processes under the expected total discounted reward criterion when both the state and action spaces are continuous. This algorithm is based on Watkins Q-learning, but uses Nadaraya-Watson kernel smoothing to generalize knowledge to unvisited states. As expected, continuity conditions must be imposed on the mean rewards and transition probabilities. Using results from kernel regression theory, this algorithm is proven capable of producing a Q-value function estimate that is uniformly within an arbitrary tolerance of the true Q-value function with probability one. The algorithm is then applied to an example problem to empirically show convergence as well.
Researcher Affiliation Academia Stephen Carden EMAIL Department of Mathematical Sciences, Clemson University
Pseudocode Yes Algorithm 1 Pseudocode for theoretical algorithm Initialize h = bandwidth value, m = maximum iterations, γ = discount factor, ϵ = exploration parameter Initialize b Qh,0(s, a) = 0 (s, a) Set initial state s1 for i=1:m do r = Uniform(0,1) random value if r < ϵ then ai = random action else ai = supa A b Qh,n 1(si, a) end if ui = next state, ri = reward yh,i := ri + γ supa A b Qh,i 1(ui, a) b Qh,i(s, a) = Pi j=1 Kh((s,a) (sj,aj))yh,j Pi j=1 Kh((s,a) (sj,aj)) si+1 = ui end for
Open Source Code Yes 1. For the full source code, see the online appendix associated with this publication.
Open Datasets Yes In this section we detail an application to the Mountain Car problem (Moore, 1991).
Dataset Splits No The paper describes the setup of the Mountain Car problem, which is a simulation environment, but it does not specify any training, validation, or test dataset splits in the traditional sense of partitioning a pre-collected dataset.
Hardware Specification Yes Implementation was in MATLAB 2012b in Ubuntu 12.04 on hardware with an Intel Xeon 3.47 gigahertz processor and 24 gigabytes of RAM.
Software Dependencies Yes Implementation was in MATLAB 2012b in Ubuntu 12.04 on hardware with an Intel Xeon 3.47 gigahertz processor and 24 gigabytes of RAM.
Experiment Setup Yes Parameter values were bandwidth h = .2, exploration parameter ϵ = .9, discount factor γ = .9, and k = 20 successful episodes to initialize with 1.