Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

Authors: Qingfeng Lan, Yangchen Pan, Alona Fyshe, Martha White

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems. 1
Researcher Affiliation Academia Department of Computing Science University of Alberta Edmonton, Alberta, Canada {qlan3,pan6,alona,whitem}@ualberta.ca
Pseudocode Yes Algorithm 1: Maxmin Q-learning
Open Source Code Yes Code is available at https://github.com/qlan3/Explorer
Open Datasets Yes Mountain Car (Sutton & Barto, 2018) is a classic testbed in Reinforcement Learning... To evaluate Maxmin DQN, we choose seven games from Gym (Brockman et al., 2016), Py Game Learning Environment (PLE) (Tasfi, 2016), and Min Atar (Young & Tian, 2019): Lunarlander, Catcher, Pixelcopter, Asterix, Seaquest, Breakout, and Space Invaders.
Dataset Splits No The paper mentions training episodes and test episodes, but does not provide explicit information about validation splits (e.g., specific percentages or sample counts for training, validation, and test sets of a dataset).
Hardware Specification No The paper does not specify any particular hardware, such as GPU or CPU models, used for running the experiments.
Software Dependencies No The paper mentions 'Tile-coding' and 'RMSprop' but does not specify version numbers for these or other software libraries/dependencies.
Experiment Setup Yes In the experiment, we used a discount factor γ = 1; a replay buffer with size 100; an ϵ-greedy behaviour with ϵ = 0.1; tabular action-values, initialized with a Gaussian distribution N(0, 0.01); and a step-size of 0.01 for all algorithms... The discount factor was 0.99. The size of the replay buffer was 10, 000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration strategy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was fixed to 0.01.