Maxmin Q-learning: Controlling the Estimation Bias of Q-learning
Authors: Qingfeng Lan, Yangchen Pan, Alona Fyshe, Martha White
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems. 1 |
| Researcher Affiliation | Academia | Department of Computing Science University of Alberta Edmonton, Alberta, Canada {qlan3,pan6,alona,whitem}@ualberta.ca |
| Pseudocode | Yes | Algorithm 1: Maxmin Q-learning |
| Open Source Code | Yes | Code is available at https://github.com/qlan3/Explorer |
| Open Datasets | Yes | Mountain Car (Sutton & Barto, 2018) is a classic testbed in Reinforcement Learning... To evaluate Maxmin DQN, we choose seven games from Gym (Brockman et al., 2016), Py Game Learning Environment (PLE) (Tasfi, 2016), and Min Atar (Young & Tian, 2019): Lunarlander, Catcher, Pixelcopter, Asterix, Seaquest, Breakout, and Space Invaders. |
| Dataset Splits | No | The paper mentions training episodes and test episodes, but does not provide explicit information about validation splits (e.g., specific percentages or sample counts for training, validation, and test sets of a dataset). |
| Hardware Specification | No | The paper does not specify any particular hardware, such as GPU or CPU models, used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Tile-coding' and 'RMSprop' but does not specify version numbers for these or other software libraries/dependencies. |
| Experiment Setup | Yes | In the experiment, we used a discount factor γ = 1; a replay buffer with size 100; an ϵ-greedy behaviour with ϵ = 0.1; tabular action-values, initialized with a Gaussian distribution N(0, 0.01); and a step-size of 0.01 for all algorithms... The discount factor was 0.99. The size of the replay buffer was 10, 000. The weights of neural networks were optimized by RMSprop with gradient clip 5. The batch size was 32. The target network was updated every 200 frames. ϵ-greedy was applied as the exploration strategy with ϵ decreasing linearly from 1.0 to 0.01 in 1, 000 steps. After 1, 000 steps, ϵ was fixed to 0.01. |