The Mean-Squared Error of Double Q-Learning

Authors: Wentao Weng, Harsh Gupta, Niao He, Lei Ying, R. Srikant

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also present some practical implications of this theoretical observation using simulations.In this paper, we focus on comparing Double Q-learning with standard Q-learning, both theoretically and experimentally. Section 4, Numerical Results, details simulations on various environments.
Researcher Affiliation Academia Wentao Weng Tsinghua University wwt17@mails.tsinghua.edu.cn Harsh Gupta University of Illinois at Urbana-Champaign hgupta10@illinois.edu Niao He University of Illinois at Urbana-Champaign niaohe@illinois.edu Lei Ying University of Michigan, Ann Arbor leiying@umich.edu R. Srikant University of Illinois at Urbana-Champaign rsrikant@illinois.edu
Pseudocode No The paper describes algorithms using mathematical equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Codes are at https://github.com/wentaoweng/The-Mean-Squared-Error-of-Double-Q-Learning.
Open Datasets Yes We train algorithms on Cart Pole-v0 available in Open AI Gym [9].
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits. For Cart Pole, it mentions evaluation of policies but not a distinct validation split for hyperparameter tuning.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components like "Open AI Gym" and "stochastic gradient descent" but does not provide specific version numbers for any libraries or frameworks used in the implementation.
Experiment Setup Yes We set the step size αn = 1000 / (n+10000). The optimal estimator, θ , is calculated by solving the projected Bellman equation [25] based on the Markov chain. Sample paths start in state 1 in Baird s Example, and state (1, 1) in Grid World. We use the uniformly random policy as the behavioral policy, i.e., each valid action is taken with equal probability in any given state. Initialization of θ1, θA 1 , θB 1 are set the same and are uniformly sampled from [0, 2]d, where d is the dimension of features. ... for the nth episode, we use ϵn = max(0.1, min(1, 1 log( n / 200))), αn = 40 / (n+100). The discount factor is set as γ = 0.999.