Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update

Authors: Su Young Lee, Choi Sungik, Sae-Young Chung

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We theoretically prove the convergence of the EBU method and experimentally demonstrate its performance in both deterministic and stochastic environments. Especially in 49 games of Atari 2600 domain, EBU achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples, respectively.
Researcher Affiliation Academia Su Young Lee, Sungik Choi, Sae-Young Chung School of Electrical Engineering, KAIST, Republic of Korea {suyoung.l, si_choi, schung}@kaist.ac.kr
Pseudocode Yes Algorithm 1 Episodic Backward Update for Tabular Q-Learning (single episode, tabular). Algorithm 2 Episodic Backward Update
Open Source Code Yes 1The code is available at https://github.com/suyoung-lee/Episodic-Backward-Update
Open Datasets Yes We use the MNIST dataset [9] for the state representation. ... We use the same set of 49 Atari 2600 games, which was evaluated in Nature DQN paper [14].
Dataset Splits No The paper describes training procedures and evaluation metrics but does not explicitly provide specific train/validation/test dataset splits in terms of percentages or counts for reproducibility.
Hardware Specification Yes Training time refers to the total time required to train 49 games of 10M frames using a single NVIDIA TITAN Xp for a single random seed.
Software Dependencies No The paper mentions using deep neural networks and specific algorithms (DQN, Q-learning) but does not provide specific version numbers for software libraries or frameworks (e.g., TensorFlow, PyTorch).
Experiment Setup Yes The details of the hyperparameters and the network structure are described in Appendix D. ... We use a discount factor γ = 0.99, an Adam optimizer [9] with an initial learning rate of 0.00025, and an ϵ-greedy exploration with ϵ annealed from 1.0 to 0.1 over the first 1M frames and fixed to 0.1 thereafter. We use a replay memory size of 100,000 transitions, and train the network with a mini-batch size of 32.