reproducibilityindex.ai

Maxmin Q-learning: Controlling the Estimation Bias of Q-learning

Authors: Qingfeng Lan, Yangchen Pan, Alona Fyshe, Martha White

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems. 1
Researcher Affiliation	Academia	Department of Computing Science University of Alberta Edmonton, Alberta, Canada {qlan3,pan6,alona,whitem}@ualberta.ca
Pseudocode	Yes	Algorithm 1: Maxmin Q-learning
Open Source Code	Yes	Code is available at https://github.com/qlan3/Explorer
Open Datasets	Yes	Mountain Car (Sutton & Barto, 2018) is a classic testbed in Reinforcement Learning... To evaluate Maxmin DQN, we choose seven games from Gym (Brockman et al., 2016), Py Game Learning Environment (PLE) (Tasﬁ, 2016), and Min Atar (Young & Tian, 2019): Lunarlander, Catcher, Pixelcopter, Asterix, Seaquest, Breakout, and Space Invaders.
Dataset Splits	No	The paper mentions training episodes and test episodes, but does not provide explicit information about validation splits (e.g., specific percentages or sample counts for training, validation, and test sets of a dataset).
Hardware Specification	No	The paper does not specify any particular hardware, such as GPU or CPU models, used for running the experiments.
Software Dependencies	No	The paper mentions 'Tile-coding' and 'RMSprop' but does not specify version numbers for these or other software libraries/dependencies.
Experiment Setup	Yes	In the experiment, we used a discount factor γ = 1; a replay buffer with size 100; an ϵ-greedy behaviour with ϵ = 0.1; tabular action-values, initialized with a Gaussian distribution N(0, 0.01); and a step-size of 0.01 for all algorithms... The discount factor was 0.99. The size of the replay buffer was 10, 000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration strategy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was ﬁxed to 0.01.