The Mean-Squared Error of Double Q-Learning
Authors: Wentao Weng, Harsh Gupta, Niao He, Lei Ying, R. Srikant
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also present some practical implications of this theoretical observation using simulations.In this paper, we focus on comparing Double Q-learning with standard Q-learning, both theoretically and experimentally. Section 4, Numerical Results, details simulations on various environments. |
| Researcher Affiliation | Academia | Wentao Weng Tsinghua University wwt17@mails.tsinghua.edu.cn Harsh Gupta University of Illinois at Urbana-Champaign hgupta10@illinois.edu Niao He University of Illinois at Urbana-Champaign niaohe@illinois.edu Lei Ying University of Michigan, Ann Arbor leiying@umich.edu R. Srikant University of Illinois at Urbana-Champaign rsrikant@illinois.edu |
| Pseudocode | No | The paper describes algorithms using mathematical equations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are at https://github.com/wentaoweng/The-Mean-Squared-Error-of-Double-Q-Learning. |
| Open Datasets | Yes | We train algorithms on Cart Pole-v0 available in Open AI Gym [9]. |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits. For Cart Pole, it mentions evaluation of policies but not a distinct validation split for hyperparameter tuning. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like "Open AI Gym" and "stochastic gradient descent" but does not provide specific version numbers for any libraries or frameworks used in the implementation. |
| Experiment Setup | Yes | We set the step size αn = 1000 / (n+10000). The optimal estimator, θ , is calculated by solving the projected Bellman equation [25] based on the Markov chain. Sample paths start in state 1 in Baird s Example, and state (1, 1) in Grid World. We use the uniformly random policy as the behavioral policy, i.e., each valid action is taken with equal probability in any given state. Initialization of θ1, θA 1 , θB 1 are set the same and are uniformly sampled from [0, 2]d, where d is the dimension of features. ... for the nth episode, we use ϵn = max(0.1, min(1, 1 log( n / 200))), αn = 40 / (n+100). The discount factor is set as γ = 0.999. |