Universal Successor Features Approximators
Authors: Diana Borsa, Andre Barreto, John Quan, Daniel J. Mankowitz, Hado van Hasselt, Remi Munos, David Silver, Tom Schaul
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we describe the experiments conducted to test the proposed architecture in a multitask setting and assess its ability to generalise to unseen tasks. |
| Researcher Affiliation | Industry | Deep Mind London, UK {borsa,andrebarreto,johnquan,dmankowitz,munos, hado,davidsilver,schaul}@google.com |
| Pseudocode | Yes | Algorithm 1 Learn USFA with ϵ-greedy Q-learning |
| Open Source Code | No | The paper provides links to videos of USFAs in action but does not explicitly state that the source code for the methodology is openly available or provide a repository link. |
| Open Datasets | Yes | We used the Deep Mind Lab platform to design a 3D environment consisting of one large room containing four types of objects: TVs, balls, hats, and balloons (Beattie et al., 2016; Barreto et al., 2018). |
| Dataset Splits | No | The paper defines training and test task sets (M and M') and notes that it evaluates on unseen tasks, but it does not specify explicit training/validation/test dataset splits with percentages or sample counts for the underlying observation data. |
| Hardware Specification | No | The paper mentions using a distributed architecture (IMPALA) for data collection and processing, implying multiple machines, but it does not specify any particular hardware components like GPU or CPU models, or memory details. |
| Software Dependencies | No | The paper mentions using the DeepMind Lab platform and the Q(λ) algorithm, but it does not specify versions for any key software components or libraries used for implementation. |
| Experiment Setup | Yes | We trained the above architecture end-to-end using a variation of Alg. 1 that uses Watkins s (1989) Q(λ) to apply Q-learning with eligibility traces. As for the distribution Dz used in line 5 of Alg. 1 we adopted a Gaussian centred at w: z N(w, 0.1 I), where I is the identity matrix. ... For all agents we used λ = 0.9. ... For the distributed collection of data we used 50 actors per task. Each actor gathered trajectories of length 32 that were then added to the common queue. The collection of data followed an ϵ-greedy policy with a fixed ϵ = 0.1. ... Evaluations are done with a small ϵ = 0.001, following a GPI policy with different instantiations of C. |