Universal Successor Features Approximators

Authors: Diana Borsa, Andre Barreto, John Quan, Daniel J. Mankowitz, Hado van Hasselt, Remi Munos, David Silver, Tom Schaul

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we describe the experiments conducted to test the proposed architecture in a multitask setting and assess its ability to generalise to unseen tasks.
Researcher Affiliation Industry Deep Mind London, UK {borsa,andrebarreto,johnquan,dmankowitz,munos, hado,davidsilver,schaul}@google.com
Pseudocode Yes Algorithm 1 Learn USFA with ϵ-greedy Q-learning
Open Source Code No The paper provides links to videos of USFAs in action but does not explicitly state that the source code for the methodology is openly available or provide a repository link.
Open Datasets Yes We used the Deep Mind Lab platform to design a 3D environment consisting of one large room containing four types of objects: TVs, balls, hats, and balloons (Beattie et al., 2016; Barreto et al., 2018).
Dataset Splits No The paper defines training and test task sets (M and M') and notes that it evaluates on unseen tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts for the underlying observation data.
Hardware Specification No The paper mentions using a distributed architecture (IMPALA) for data collection and processing, implying multiple machines, but it does not specify any particular hardware components like GPU or CPU models, or memory details.
Software Dependencies No The paper mentions using the DeepMind Lab platform and the Q(λ) algorithm, but it does not specify versions for any key software components or libraries used for implementation.
Experiment Setup Yes We trained the above architecture end-to-end using a variation of Alg. 1 that uses Watkins s (1989) Q(λ) to apply Q-learning with eligibility traces. As for the distribution Dz used in line 5 of Alg. 1 we adopted a Gaussian centred at w: z N(w, 0.1 I), where I is the identity matrix. ... For all agents we used λ = 0.9. ... For the distributed collection of data we used 50 actors per task. Each actor gathered trajectories of length 32 that were then added to the common queue. The collection of data followed an ϵ-greedy policy with a fixed ϵ = 0.1. ... Evaluations are done with a small ϵ = 0.001, following a GPI policy with different instantiations of C.