Adaptive Order Q-learning
Authors: Tao Tan, Hong Xie, Defu Lian
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We start with tabular MDP experiments, to reveal fundamental insights into why Order Q-learning and Ada Order Qlearning can achieve superior performance. We then evaluate the impact of Order DQN and Ada Order DQN in deep reinforcement learning settings. |
| Researcher Affiliation | Academia | Tao Tan1 , Hong Xie2 , Defu Lian2 1College of Computer Science, Chongqing University 2University of Science and Technology of China |
| Pseudocode | Yes | Algorithm 1 Order Q-learning, Algorithm 2 Order DQN, Algorithm 3 Ada Order Q-learning, Algorithm 4 Ada Order DQN |
| Open Source Code | Yes | The code of all experiments can be found in link1. https://1drv.ms/u/s!Atddp1ia Dm L2ghdc Hy YXNO785mo D |
| Open Datasets | Yes | We introduce three tabular MDP environments: (1) Multi-armed bandit is adapted from [Mannor et al., 2007], which considers the single-state with ten action and the reward of each action obeys the distribution N(0, 1); (2) A simple MDP environment is depicted in Figure 1, where µ1 = 0.1, σ1 = 1.0, µ2 = 0.1, σ2 = 1.0; (3) Gridworld [Zhang et al., 2017] has four actions, i.e., up, down, left, and right for each state. ... To evaluate the impact of Order DQN and Ada Order DQN, we choose three common deep reinforcement learning games from Py Game Learning Environment [Urtans and Nikitenko, 2018] and Min Atar [Young and Tian, 2019]: Pixelcopter, Breakout, and Asterix. |
| Dataset Splits | No | For the Pixelcopter environment, we set |D| = 10, 000, V = 200, and α = 0.001. ε decreases linearly from 1.0 to 0.01 in 1, 000 steps, and fixes to 0.01 after 1, 000 steps. For the Breakout and Asterix environments, we set |D| = 100, 000, V = 1, 000, and α = 0.01. ε decreases linearly from 1.0 to 0.1 in 100, 000 steps, and fixes to 0.1 after 100, 000 steps. The paper does not explicitly specify a validation dataset split, only training parameters. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using 'Py Game Learning Environment' and 'Min Atar' but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | Following [Hasselt, 2010; Zhu and Rigotti, 2021; Pentaliotis and Wiering, 2021], we set γ = 0.95, α = 1 n(s,a)0.8 , and ε = 1 n(s)0.5 for the Multiarmed bandit and Gridworld environments; and set γ = 1.0, α = 0.1, and ε = 0.1 for the MDP environment. ... For the Pixelcopter environment, we set |D| = 10, 000, V = 200, and α = 0.001. ε decreases linearly from 1.0 to 0.01 in 1, 000 steps, and fixes to 0.01 after 1, 000 steps. For the Breakout and Asterix environments, we set |D| = 100, 000, V = 1, 000, and α = 0.01. ε decreases linearly from 1.0 to 0.1 in 100, 000 steps, and fixes to 0.1 after 100, 000 steps. |