Adaptive Order Q-learning

Authors: Tao Tan, Hong Xie, Defu Lian

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We start with tabular MDP experiments, to reveal fundamental insights into why Order Q-learning and Ada Order Qlearning can achieve superior performance. We then evaluate the impact of Order DQN and Ada Order DQN in deep reinforcement learning settings.
Researcher Affiliation Academia Tao Tan1 , Hong Xie2 , Defu Lian2 1College of Computer Science, Chongqing University 2University of Science and Technology of China
Pseudocode Yes Algorithm 1 Order Q-learning, Algorithm 2 Order DQN, Algorithm 3 Ada Order Q-learning, Algorithm 4 Ada Order DQN
Open Source Code Yes The code of all experiments can be found in link1. https://1drv.ms/u/s!Atddp1ia Dm L2ghdc Hy YXNO785mo D
Open Datasets Yes We introduce three tabular MDP environments: (1) Multi-armed bandit is adapted from [Mannor et al., 2007], which considers the single-state with ten action and the reward of each action obeys the distribution N(0, 1); (2) A simple MDP environment is depicted in Figure 1, where µ1 = 0.1, σ1 = 1.0, µ2 = 0.1, σ2 = 1.0; (3) Gridworld [Zhang et al., 2017] has four actions, i.e., up, down, left, and right for each state. ... To evaluate the impact of Order DQN and Ada Order DQN, we choose three common deep reinforcement learning games from Py Game Learning Environment [Urtans and Nikitenko, 2018] and Min Atar [Young and Tian, 2019]: Pixelcopter, Breakout, and Asterix.
Dataset Splits No For the Pixelcopter environment, we set |D| = 10, 000, V = 200, and α = 0.001. ε decreases linearly from 1.0 to 0.01 in 1, 000 steps, and fixes to 0.01 after 1, 000 steps. For the Breakout and Asterix environments, we set |D| = 100, 000, V = 1, 000, and α = 0.01. ε decreases linearly from 1.0 to 0.1 in 100, 000 steps, and fixes to 0.1 after 100, 000 steps. The paper does not explicitly specify a validation dataset split, only training parameters.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory specifications used for running experiments.
Software Dependencies No The paper mentions using 'Py Game Learning Environment' and 'Min Atar' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes Following [Hasselt, 2010; Zhu and Rigotti, 2021; Pentaliotis and Wiering, 2021], we set γ = 0.95, α = 1 n(s,a)0.8 , and ε = 1 n(s)0.5 for the Multiarmed bandit and Gridworld environments; and set γ = 1.0, α = 0.1, and ε = 0.1 for the MDP environment. ... For the Pixelcopter environment, we set |D| = 10, 000, V = 200, and α = 0.001. ε decreases linearly from 1.0 to 0.01 in 1, 000 steps, and fixes to 0.01 after 1, 000 steps. For the Breakout and Asterix environments, we set |D| = 100, 000, V = 1, 000, and α = 0.01. ε decreases linearly from 1.0 to 0.1 in 100, 000 steps, and fixes to 0.1 after 100, 000 steps.