Fast Task Inference with Variational Intrinsic Successor Features

Authors: Steven Hansen, Will Dabney, Andre Barreto, David Warde-Farley, Tom Van de Wiele, Volodymyr Mnih

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate VISR on the full Atari suite, in a novel setup wherein the rewards are only exposed briefly after a long unsupervised phase. Achieving human-level performance on 12 games and beating all baselines, we believe VISR represents a step towards agents that rapidly learn from limited feedback. Our experiments are divided in four groups corresponding to Sections 6.1 to 6.4.
Researcher Affiliation Industry Steven Hansen Deep Mind stevenhansen@google.com Will Dabney Deep Mind wdabney@google.com André Barreto Deep Mind andrebarreto@google.com David Warde-Farley Deep Mind dwf@google.com Tom Van de Wiele Deep Mind (Former) tvdwiele@gmail.com Volodymyr Mnih Deep Mind vmnih@google.com
Pseudocode Yes Algorithm 1: Training VISR
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes To evaluate VISR, we impose a two-phase setup on the full suite of 57 Atari games (Bellemare et al., 2013).
Dataset Splits No The paper describes 'unsupervised training phase' and 'short test phase', specifying durations in steps. However, it does not mention a distinct 'validation' dataset split or provide specific percentages/counts for data partitioning in the traditional sense for training, validation, and test sets. Instead, the 'data' is generated through interaction in the RL environment.
Hardware Specification No The paper mentions a 'distributed reinforcement learning setup' with '100 separate actors' but does not specify any particular CPU, GPU, or TPU models, or detailed cloud instance types used for these actors or the centralized learner.
Software Dependencies No The paper mentions the 'Adam optimizer' (Kingma and Ba, 2014) but does not provide specific version numbers for any software, libraries, or frameworks used in their implementation.
Experiment Setup Yes The Adam optimizer (Kingma and Ba, 2014) was used with a learning rate of 10 4 and an ϵ of 10 3 as in Kapturowski et al. (2018). The dimensionality of task vectors was swept-over (with values between 2 and 50 considered), with 5 eventually chosen. The discount factor γ was .99. Standard batch size of 32. A constant ϵ-greedy action-selection strategy with an ϵ of 0.05 for both training and testing. The frames are scaled to 84 x 84, normalized, and the most recent 4 frames are stacked. At the beginning of each episode, between 1 and 30 no-ops are executed to provide a source of stochasticity. A 5 minute time-limit is imposed on both training and testing episodes.