Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Non-Asymptotic Guarantees for Average-Reward Q-Learning with Adaptive Stepsizes

Authors: Zaiwei Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical This work presents the first finite-time analysis of average-reward Q-learning with an asynchronous implementation. A key feature of the algorithm we study is the use of adaptive stepsizes that act as local clocks for each state-action pair. We show that the mean-square error of this Q-learning algorithm, measured in the span seminorm, converges at a rate of O(1/k). To establish this result, we demonstrate that adaptive stepsizes are necessary: without them, the algorithm fails to converge to the correct target. Moreover, adaptive stepsizes can be viewed as a form of implicit importance sampling that counteracts the effect of asynchronous updates. Technically, the use of adaptive stepsizes causes each Q-learning update to depend on the full sample history, introducing strong correlations and making the algorithm a non-Markovian stochastic approximation (SA) scheme. Our approach to overcoming this challenge involves (1) a time-inhomogeneous Markovian reformulation of non-Markovian SA, and (2) a combination of almost-sure time-varying bounds, conditioning arguments, and Markov chain concentration inequalities to break the strong correlations between the adaptive stepsizes and the iterates.
Researcher Affiliation Academia Zaiwei Chen School of Industrial Engineering, Purdue University EMAIL
Pseudocode Yes Algorithm 1 Q-Learning 1: Input: Initializations Q1 = 0 R|S||A|, S1 S, and a behavior policy π. 2: for k = 1, 2, , do 3: Take Ak π( | Sk), observe Sk+1 p( | Sk, Ak), and receive R(Sk, Ak). 4: Compute the temporal difference: δk = R(Sk, Ak) + maxa Qk(Sk+1, a ) Qk(Sk, Ak). 5: Update the Q-function: Qk+1(s, a) = Qk(s, a) + αk(s, a)1{(s,a)=(Sk,Ak)}δk for all (s, a), where αk(s, a) = α/(Nk(s, a)+h), and Nk(s, a) := Pk i=1 1{(Si,Ai)=(s,a)} denotes the total number of visits to the state-action pair (s, a) up to the k-th iteration. 6: end for 7: Output: {Qk}k 1
Open Source Code No The paper is a theoretical work and does not mention releasing any source code. The 'Open access to data and code' question in the NeurIPS checklist is marked as NA with the justification 'This is a theoretical work.'.
Open Datasets No The paper is a theoretical work and does not involve experiments on datasets. The 'Open access to data and code' question in the NeurIPS checklist is marked as NA with the justification 'This is a theoretical work.'.
Dataset Splits No The paper is a theoretical work and does not describe any experimental evaluation using datasets, therefore, there is no information on dataset splits.
Hardware Specification No The paper is a theoretical work focused on mathematical analysis and proofs. There is no mention of hardware specifications because no empirical experiments were conducted. The 'Experiments compute resources' question in the NeurIPS checklist is marked as NA with the justification 'This is a theoretical work.'.
Software Dependencies No The paper is a theoretical work and does not describe any software dependencies with specific version numbers. The 'Experimental setting/details' question in the NeurIPS checklist is marked as NA with the justification 'This is a theoretical work.'.
Experiment Setup No The paper is a theoretical work and focuses on the mathematical analysis of an algorithm. It does not describe an experimental setup with hyperparameters or training settings. The 'Experimental setting/details' question in the NeurIPS checklist is marked as NA with the justification 'This is a theoretical work.'.