Weighted Double Q-learning

Authors: Zongzhang Zhang, Zhiyuan Pan, Mykel J. Kochenderfer

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, the new algorithm is shown to perform well on several MDP problems. We present empirical results of estimators of the maximum expected value on three groups of multi-arm bandit problems, and compare Q-learning and its variants in terms of the action-value estimate and policy quality on MDP problems.
Researcher Affiliation Academia Zongzhang Zhang Soochow University Suzhou, Jiangsu 215006 China zzzhang@suda.edu.cn Zhiyuan Pan Soochow University Suzhou, Jiangsu 215006 China owenpzy@gmail.com Mykel J. Kochenderfer Stanford University Stanford, CA 94305 USA mykel@stanford.edu
Pseudocode Yes Algorithm 1 Q-learning, Algorithm 2 Double Q-learning, Algorithm 3 Weighted Double Q-learning
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets No The paper describes problem setups (multi-arm bandits, grid world, intruder monitoring) from which data is generated or simulated, but does not provide access information (links, citations) for publicly available datasets. For multi-arm bandits, it explicitly states, “generate µi from N(E{Xi}, 1) for all i and then generate 1000 samples di from N(µi, 1).”
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes Algorithms with a polynomial learning rate, αt(s, a) = 1/nt(s, a)m with m = 0.8, was shown to have better performance than ones with a linear learning rate, αt(s, a) = 1/nt(s, a) [van Hasselt, 2010]. This paper focuses on empirical results of Q-learning, biased-corrected Q-learning and weighted Q-learning with parameter m = 0.8, and both double Q-learning and weighted double Q-learning with parameters m U = 0.8 and m V = 0.8 in the two learning rates αU t (s, a) = [1/n U t (s, a)]m U and αV t (s, a) = [1/n V t (s, a)]m V, where the variables n U t (s, a) and n V t (s, a) store the number of updates in QU(s, a) and QV (s, a), respectively. The action-selection strategy in all algorithms was ϵ-greedy with ϵ(s) = 1/nt(s)0.5, where nt(s) is the number of times state s has been visited. For the weighted double estimator method, we report the results with the input parameter c = 1, 10, 100. In this paper, we set k = 1 because it leads to a much lower estimation error than DE on G2 and G3, as shown in Fig. 2.