Weighted Double Q-learning
Authors: Zongzhang Zhang, Zhiyuan Pan, Mykel J. Kochenderfer
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, the new algorithm is shown to perform well on several MDP problems. We present empirical results of estimators of the maximum expected value on three groups of multi-arm bandit problems, and compare Q-learning and its variants in terms of the action-value estimate and policy quality on MDP problems. |
| Researcher Affiliation | Academia | Zongzhang Zhang Soochow University Suzhou, Jiangsu 215006 China zzzhang@suda.edu.cn Zhiyuan Pan Soochow University Suzhou, Jiangsu 215006 China owenpzy@gmail.com Mykel J. Kochenderfer Stanford University Stanford, CA 94305 USA mykel@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Q-learning, Algorithm 2 Double Q-learning, Algorithm 3 Weighted Double Q-learning |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | No | The paper describes problem setups (multi-arm bandits, grid world, intruder monitoring) from which data is generated or simulated, but does not provide access information (links, citations) for publicly available datasets. For multi-arm bandits, it explicitly states, “generate µi from N(E{Xi}, 1) for all i and then generate 1000 samples di from N(µi, 1).” |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment. |
| Experiment Setup | Yes | Algorithms with a polynomial learning rate, αt(s, a) = 1/nt(s, a)m with m = 0.8, was shown to have better performance than ones with a linear learning rate, αt(s, a) = 1/nt(s, a) [van Hasselt, 2010]. This paper focuses on empirical results of Q-learning, biased-corrected Q-learning and weighted Q-learning with parameter m = 0.8, and both double Q-learning and weighted double Q-learning with parameters m U = 0.8 and m V = 0.8 in the two learning rates αU t (s, a) = [1/n U t (s, a)]m U and αV t (s, a) = [1/n V t (s, a)]m V, where the variables n U t (s, a) and n V t (s, a) store the number of updates in QU(s, a) and QV (s, a), respectively. The action-selection strategy in all algorithms was ϵ-greedy with ϵ(s) = 1/nt(s)0.5, where nt(s) is the number of times state s has been visited. For the weighted double estimator method, we report the results with the input parameter c = 1, 10, 100. In this paper, we set k = 1 because it leads to a much lower estimation error than DE on G2 and G3, as shown in Fig. 2. |