Value-driven Hindsight Modelling

Authors: Arthur Guez, Fabio Viola, Theophane Weber, Lars Buesing, Steven Kapturowski, Doina Precup, David Silver, Nicolas Heess

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Experiments The illustrative example in Sec. 3.2 demonstrated the positive effect of hindsight modelling in a simple policy evaluation setting. We now explore these benefits in the context of policy optimization in challenging domains: a custom navigation task called Portal Choice, and Atari 2600.
Researcher Affiliation Industry Deep Mind aguez@google.com
Pseudocode No The paper does not contain any sections explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present structured algorithm blocks.
Open Source Code No The paper does not provide any concrete access information to source code, such as repository links or explicit statements about code availability.
Open Datasets Yes We tested our approach in Atari 2600 videogames using the Arcade Learning Environment [2].
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits, such as percentages or sample counts, nor does it refer to pre-defined splits for reproducibility beyond stating the use of Atari games for evaluation.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU or CPU models, memory, or detailed computer specifications used for running its experiments.
Software Dependencies No The paper mentions using IMPALA and R2D2, which are frameworks, but it does not specify software versions for any libraries, programming languages, or other ancillary software components used in the experiments.
Experiment Setup Yes We ran Hi Mo on 57 Atari games for 200k gradient steps (around 1 day of training), with 3 seeds for each game. The evaluation averages the score between 200 episodes across seeds, each lasting a maximum of 30 minutes and starting with a random number (up to 30) of no-op actions... The different losses in the Hi Mo architecture are combined in the following way: L(θ, η) = Lv(η) + αLv+(θ) + βLmodel(η). ...Ut = g Pn 1 m=0 γm Rt+m + γng 1 (qm(St+n, A ; η )) , where g is an invertible function, η are the periodically updated target network parameters (as in DQN [12]), and A = arg maxa qm(St+n, a; η) (the Double DQN update [22]).