Fast deep reinforcement learning using online adjustments from the past
Authors: Steven Hansen, Alexander Pritzel, Pablo Sprechmann, Andre Barreto, Charles Blundell
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that EVA is performant on a demonstration task and Atari games. ... 5 Experiments |
| Researcher Affiliation | Industry | {stevenhansen,psprechmann,apritzel,andrebarreto,cblundell}@google.com |
| Pseudocode | Yes | Algorithm 1: Ephemerally Value Adjustments |
| Open Source Code | No | The paper does not provide an explicit statement or a link to open-source code for the described methodology. |
| Open Datasets | Yes | We begin the experimental section by showing how EVA works on a simple gridworld environment implemented with the pycolab game engine [Stepleton, 2017]. ... In order to validate whether EVA leads to gains in complex domains we evaluated our approach on the Atari Learning Environment(ALE; Bellemare et al., 2013). |
| Dataset Splits | No | The paper describes using a replay buffer for training and tuning hyperparameters on a subset of 5 Atari games, but it does not specify explicit train/validation/test dataset splits (e.g., percentages or sample counts) for its environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions the use of 'pycolab game engine [Stepleton, 2017]' and 'Atari Learning Environment(ALE; Bellemare et al., 2013)', but does not provide specific software dependency versions (e.g., Python, library versions like TensorFlow or PyTorch). |
| Experiment Setup | Yes | Periodically (every 20 steps in all the reported experiments), the k nearest neighbours in the global buffer are queried from the current state embedding (on the basis of their ℓ2 distance). Using the stored trajectory information, the 50 subsequent steps are also retrieved for each neighbour. ... The hyper-parameters shared between the baseline and EVA (e.g. learning rate) were chosen to maximise the performance of the baseline (λ = 0) on a run over 20M frames on the selected subset of games. ... Performance saturates around λ = 0.4 as in the simple example. We chose the lowest frequency that would not harm performance (20 steps), the rollout length was set to 50 and the number of neighbours used for estimating QNP was set to 5. |