Value-driven Hindsight Modelling
Authors: Arthur Guez, Fabio Viola, Theophane Weber, Lars Buesing, Steven Kapturowski, Doina Precup, David Silver, Nicolas Heess
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments The illustrative example in Sec. 3.2 demonstrated the positive effect of hindsight modelling in a simple policy evaluation setting. We now explore these benefits in the context of policy optimization in challenging domains: a custom navigation task called Portal Choice, and Atari 2600. |
| Researcher Affiliation | Industry | Deep Mind aguez@google.com |
| Pseudocode | No | The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access information to source code, such as repository links or explicit statements about code availability. |
| Open Datasets | Yes | We tested our approach in Atari 2600 videogames using the Arcade Learning Environment [2]. |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits, such as percentages or sample counts, nor does it refer to pre-defined splits for reproducibility beyond stating the use of Atari games for evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU or CPU models, memory, or detailed computer specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using IMPALA and R2D2, which are frameworks, but it does not specify software versions for any libraries, programming languages, or other ancillary software components used in the experiments. |
| Experiment Setup | Yes | We ran Hi Mo on 57 Atari games for 200k gradient steps (around 1 day of training), with 3 seeds for each game. The evaluation averages the score between 200 episodes across seeds, each lasting a maximum of 30 minutes and starting with a random number (up to 30) of no-op actions... The different losses in the Hi Mo architecture are combined in the following way: L(θ, η) = Lv(η) + αLv+(θ) + βLmodel(η). ...Ut = g Pn 1 m=0 γm Rt+m + γng 1 (qm(St+n, A ; η )) , where g is an invertible function, η are the periodically updated target network parameters (as in DQN [12]), and A = arg maxa qm(St+n, a; η) (the Double DQN update [22]). |