Episodic Memory Deep Q-Networks
Authors: Zichuan Lin, Tianqi Zhao, Guangwen Yang, Lintao Zhang
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013]. |
| Researcher Affiliation | Collaboration | Zichuan Lin13, Tianqi Zhao2, Guangwen Yang1, Lintao Zhang3 1Tsinghua University 2Microsoft 3Microsoft Research |
| Pseudocode | No | The paper describes the algorithm steps in paragraph form and equations but does not provide a formal pseudocode block or an algorithm box. |
| Open Source Code | No | No explicit statement regarding the release of source code or a link to a code repository was found. |
| Open Datasets | Yes | We evaluated EMDQN on the benchmark suite of 57 Atari 2600 games from the arcade learning environment [Bellemare et al., 2013]. |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly provide details for a separate validation dataset split. |
| Hardware Specification | No | No specific hardware details such as GPU or CPU models, memory, or detailed computer specifications used for running experiments were provided. |
| Software Dependencies | No | No specific software dependencies, libraries, or solvers with version numbers were mentioned. |
| Experiment Setup | Yes | EMDQN follows all of the networks and hyper-parameter settings as DQN as presented in [Mnih et al., 2015]. Rewards are clipped to [ 1, 1] when computing the true discounted return Rt. The coefficient λ was tuned comparing values of {0.01, 0.05, 0.1, 0.2, 0.5, 1.0} on the games Alien , Atlantis , Beamrider , Gopher , Zaxxon but we found that larger value of λ will deteriorate the performance. Therefore, we finally fix the value of λ at 0.1 to regularize Q value during training. For more efficient table lookup, we use random projection technique and project the states into vectors where the dimensions dimh equals to 4. Specifically, we generate a matrix with values drawn from the distribution N(0, 1 dimh ) and fix the matrix during training. Our state buffer size is set to 5 Million for each action, and the recent least updated state will be substituted when the buffer is full. The memory table is updated in every 10000 training steps. We clip the gradient of (Qθ(si, ai) S(si, ai))2 and (Qθ(si, ai) H(si, ai))2 in Eq. (6) to [ 1, 1] respectively. |