Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update
Authors: Su Young Lee, Choi Sungik, Sae-Young Chung
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively. |
| Researcher Affiliation | Academia | Su Young Lee, Sungik Choi, Sae-Young Chung School of Electrical Engineering, KAIST, Republic of Korea {suyoung.l, si_choi, schung}@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 Episodic Backward Update for Tabular Q-Learning (single episode, tabular). Algorithm 2 Episodic Backward Update |
| Open Source Code | Yes | 1The code is available at https://github.com/suyoung-lee/Episodic-Backward-Update |
| Open Datasets | Yes | We use the MNIST dataset [9] for the state representation. ... We use the same set of 49 Atari 2600 games, which was evaluated in Nature DQN paper [14]. |
| Dataset Splits | No | The paper describes training procedures and evaluation metrics but does not explicitly provide specific train/validation/test dataset splits in terms of percentages or counts for reproducibility. |
| Hardware Specification | Yes | Training time refers to the total time required to train 49 games of 10M frames using a single NVIDIA TITAN Xp for a single random seed. |
| Software Dependencies | No | The paper mentions using deep neural networks and specific algorithms (DQN, Q-learning) but does not provide specific version numbers for software libraries or frameworks (e.g., TensorFlow, PyTorch). |
| Experiment Setup | Yes | The details of the hyperparameters and the network structure are described in Appendix D. ... We use a discount factor γ = 0.99, an Adam optimizer [9] with an initial learning rate of 0.00025, and an ϵ-greedy exploration with ϵ annealed from 1.0 to 0.1 over the first 1M frames and fixed to 0.1 thereafter. We use a replay memory size of 100,000 transitions, and train the network with a mini-batch size of 32. |