reproducibilityindex.ai

Episodic Memory Deep Q-Networks

Authors: Zichuan Lin, Tianqi Zhao, Guangwen Yang, Lintao Zhang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013].
Researcher Affiliation	Collaboration	Zichuan Lin13, Tianqi Zhao2, Guangwen Yang1, Lintao Zhang3 1Tsinghua University 2Microsoft 3Microsoft Research
Pseudocode	No	The paper describes the algorithm steps in paragraph form and equations but does not provide a formal pseudocode block or an algorithm box.
Open Source Code	No	No explicit statement regarding the release of source code or a link to a code repository was found.
Open Datasets	Yes	We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013].
Dataset Splits	No	The paper mentions training and testing but does not explicitly provide details for a separate validation dataset split.
Hardware Specification	No	No specific hardware details such as GPU or CPU models, memory, or detailed computer specifications used for running experiments were provided.
Software Dependencies	No	No specific software dependencies, libraries, or solvers with version numbers were mentioned.
Experiment Setup	Yes	EMDQN follows all of the networks and hyper-parameter settings as DQN as presented in [Mnih et al., 2015]. Rewards are clipped to [ 1, 1] when computing the true discounted return Rt. The coefﬁcient λ was tuned comparing values of {0.01, 0.05, 0.1, 0.2, 0.5, 1.0} on the games Alien , Atlantis , Beamrider , Gopher , Zaxxon but we found that larger value of λ will deteriorate the performance. Therefore, we ﬁnally ﬁx the value of λ at 0.1 to regularize Q value during training. For more efﬁcient table lookup, we use random projection technique and project the states into vectors where the dimensions dimh equals to 4. Specifically, we generate a matrix with values drawn from the distribution N(0, 1 dimh ) and ﬁx the matrix during training. Our state buffer size is set to 5 Million for each action, and the recent least updated state will be substituted when the buffer is full. The memory table is updated in every 10000 training steps. We clip the gradient of (Qθ(si, ai) S(si, ai))2 and (Qθ(si, ai) H(si, ai))2 in Eq. (6) to [ 1, 1] respectively.