Fast deep reinforcement learning using online adjustments from the past

Authors: Steven Hansen, Alexander Pritzel, Pablo Sprechmann, Andre Barreto, Charles Blundell

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that EVA is performant on a demonstration task and Atari games. ... 5 Experiments
Researcher Affiliation Industry {stevenhansen,psprechmann,apritzel,andrebarreto,cblundell}@google.com
Pseudocode Yes Algorithm 1: Ephemerally Value Adjustments
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets Yes We begin the experimental section by showing how EVA works on a simple gridworld environment implemented with the pycolab game engine [Stepleton, 2017]. ... In order to validate whether EVA leads to gains in complex domains we evaluated our approach on the Atari Learning Environment(ALE; Bellemare et al., 2013).
Dataset Splits No The paper describes using a replay buffer for training and tuning hyperparameters on a subset of 5 Atari games, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for its environments.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions the use of 'pycolab game engine [Stepleton, 2017]' and 'Atari Learning Environment(ALE; Bellemare et al., 2013)', but does not provide specific software dependency versions (e.g., Python, library versions like TensorFlow or PyTorch).
Experiment Setup Yes Periodically (every 20 steps in all the reported experiments), the k nearest neighbours in the global buffer are queried from the current state embedding (on the basis of their ℓ2 distance). Using the stored trajectory information, the 50 subsequent steps are also retrieved for each neighbour. ... The hyper-parameters shared between the baseline and EVA (e.g. learning rate) were chosen to maximise the performance of the baseline (λ = 0) on a run over 20M frames on the selected subset of games. ... Performance saturates around λ = 0.4 as in the simple example. We chose the lowest frequency that would not harm performance (20 steps), the rollout length was set to 50 and the number of neighbours used for estimating QNP was set to 5.