reproducibilityindex.ai

Universal Successor Features Approximators

Authors: Diana Borsa, Andre Barreto, John Quan, Daniel J. Mankowitz, Hado van Hasselt, Remi Munos, David Silver, Tom Schaul

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we describe the experiments conducted to test the proposed architecture in a multitask setting and assess its ability to generalise to unseen tasks.
Researcher Affiliation	Industry	Deep Mind London, UK {borsa,andrebarreto,johnquan,dmankowitz,munos, hado,davidsilver,schaul}@google.com
Pseudocode	Yes	Algorithm 1 Learn USFA with ϵ-greedy Q-learning
Open Source Code	No	The paper provides links to videos of USFAs in action but does not explicitly state that the source code for the methodology is openly available or provide a repository link.
Open Datasets	Yes	We used the Deep Mind Lab platform to design a 3D environment consisting of one large room containing four types of objects: TVs, balls, hats, and balloons (Beattie et al., 2016; Barreto et al., 2018).
Dataset Splits	No	The paper defines training and test task sets (M and M') and notes that it evaluates on unseen tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts for the underlying observation data.
Hardware Specification	No	The paper mentions using a distributed architecture (IMPALA) for data collection and processing, implying multiple machines, but it does not specify any particular hardware components like GPU or CPU models, or memory details.
Software Dependencies	No	The paper mentions using the DeepMind Lab platform and the Q(λ) algorithm, but it does not specify versions for any key software components or libraries used for implementation.
Experiment Setup	Yes	We trained the above architecture end-to-end using a variation of Alg. 1 that uses Watkins s (1989) Q(λ) to apply Q-learning with eligibility traces. As for the distribution Dz used in line 5 of Alg. 1 we adopted a Gaussian centred at w: z N(w, 0.1 I), where I is the identity matrix. ... For all agents we used λ = 0.9. ... For the distributed collection of data we used 50 actors per task. Each actor gathered trajectories of length 32 that were then added to the common queue. The collection of data followed an ϵ-greedy policy with a ﬁxed ϵ = 0.1. ... Evaluations are done with a small ϵ = 0.001, following a GPI policy with different instantiations of C.