Episodic Memory Deep Q-Networks

Authors: Zichuan Lin, Tianqi Zhao, Guangwen Yang, Lintao Zhang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013].
Researcher Affiliation Collaboration Zichuan Lin13, Tianqi Zhao2, Guangwen Yang1, Lintao Zhang3 1Tsinghua University 2Microsoft 3Microsoft Research
Pseudocode No The paper describes the algorithm steps in paragraph form and equations but does not provide a formal pseudocode block or an algorithm box.
Open Source Code No No explicit statement regarding the release of source code or a link to a code repository was found.
Open Datasets Yes We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013].
Dataset Splits No The paper mentions training and testing but does not explicitly provide details for a separate validation dataset split.
Hardware Specification No No specific hardware details such as GPU or CPU models, memory, or detailed computer specifications used for running experiments were provided.
Software Dependencies No No specific software dependencies, libraries, or solvers with version numbers were mentioned.
Experiment Setup Yes EMDQN follows all of the networks and hyper-parameter settings as DQN as presented in [Mnih et al., 2015]. Rewards are clipped to [ 1, 1] when computing the true discounted return Rt. The coefficient λ was tuned comparing values of {0.01, 0.05, 0.1, 0.2, 0.5, 1.0} on the games Alien , Atlantis , Beamrider , Gopher , Zaxxon but we found that larger value of λ will deteriorate the performance. Therefore, we finally fix the value of λ at 0.1 to regularize Q value during training. For more efficient table lookup, we use random projection technique and project the states into vectors where the dimensions dimh equals to 4. Specifically, we generate a matrix with values drawn from the distribution N(0, 1 dimh ) and fix the matrix during training. Our state buffer size is set to 5 Million for each action, and the recent least updated state will be substituted when the buffer is full. The memory table is updated in every 10000 training steps. We clip the gradient of (Qθ(si, ai) S(si, ai))2 and (Qθ(si, ai) H(si, ai))2 in Eq. (6) to [ 1, 1] respectively.