reproducibilityindex.ai

Reinforcement Learning with Unsupervised Auxiliary Tasks

Authors: Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we introduce an agent that also learns separate policies for maximising many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent signiﬁcantly outperforms the previous state-of-the-art on Atari, averaging 880% expert human performance, and a challenging suite of ﬁrst-person, three-dimensional Labyrinth tasks leading to a mean speedup in learning of 10 and averaging 87% expert human performance on Labyrinth. In Section 4 we apply our UNREAL agent to a challenging set of 3D-vision based domains known as the Labyrinth (Mnih et al., 2016), learning solely from the raw RGB pixels of a ﬁrst-person view.
Researcher Affiliation	Industry	Deep Mind London, UK {jaderberg,vmnih,lejlot,schaul,jzl,davidsilver,korayk}@google.com
Pseudocode	No	The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper provides links to YouTube videos for visualization of the agent's performance, but no concrete access to the source code for the described methodology is provided.
Open Datasets	Yes	We applied the UNREAL agent as well as UNREAL without pixel control to 57 Atari games from the Arcade Learning Environment (Bellemare et al., 2012) domain. In all our experiments we used an A3C CNN-LSTM agent as our baseline and the UNREAL agent along with its ablated variants added auxiliary outputs and losses to this base agent. The agent is trained on-policy with 20-step returns and the auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent.
Dataset Splits	No	The paper describes hyperparameter sweeps and selecting 'top-3' or 'top-5 jobs' which implies a validation process for hyperparameter tuning. However, it does not explicitly provide details about training/validation/test dataset splits with percentages or counts.
Hardware Specification	No	The paper describes the neural network architecture and training process but does not specify any particular hardware (e.g., GPU models, CPU types) used for the experiments.
Software Dependencies	No	The paper mentions using specific algorithms and components like 'A3C', 'CNN-LSTM agent', 'RMSprop', and 'LSTM with forget gates (Gers et al., 2000)'. However, it does not list specific software libraries or frameworks with version numbers (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	In all our experiments we used an A3C CNN-LSTM agent as our baseline and the UNREAL agent along with its ablated variants added auxiliary outputs and losses to this base agent. The agent is trained on-policy with 20-step returns and the auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent. The replay buffer stores the most recent 2k observations, actions, and rewards taken by the base agent. The agents are optimised over 32 asynchronous threads with shared RMSprop (Mnih et al., 2016). The learning rates are sampled from a log-uniform distribution between 0.0001 and 0.005. The entropy costs are sampled from the log-uniform distribution between 0.0005 and 0.01.