Reinforcement Learning with Unsupervised Auxiliary Tasks
Authors: Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, Koray Kavukcuoglu
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we introduce an agent that also learns separate policies for maximising many other pseudo-reward functions simultaneously by reinforcement learning. All of these tasks share a common representation that, like unsupervised learning, continues to develop in the absence of extrinsic rewards. We also introduce a novel mechanism for focusing this representation upon extrinsic rewards, so that learning can rapidly adapt to the most relevant aspects of the actual task. Our agent significantly outperforms the previous state-of-the-art on Atari, averaging 880% expert human performance, and a challenging suite of first-person, three-dimensional Labyrinth tasks leading to a mean speedup in learning of 10 and averaging 87% expert human performance on Labyrinth. In Section 4 we apply our UNREAL agent to a challenging set of 3D-vision based domains known as the Labyrinth (Mnih et al., 2016), learning solely from the raw RGB pixels of a first-person view. |
| Researcher Affiliation | Industry | Deep Mind London, UK {jaderberg,vmnih,lejlot,schaul,jzl,davidsilver,korayk}@google.com |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides links to YouTube videos for visualization of the agent's performance, but no concrete access to the source code for the described methodology is provided. |
| Open Datasets | Yes | We applied the UNREAL agent as well as UNREAL without pixel control to 57 Atari games from the Arcade Learning Environment (Bellemare et al., 2012) domain. In all our experiments we used an A3C CNN-LSTM agent as our baseline and the UNREAL agent along with its ablated variants added auxiliary outputs and losses to this base agent. The agent is trained on-policy with 20-step returns and the auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent. |
| Dataset Splits | No | The paper describes hyperparameter sweeps and selecting 'top-3' or 'top-5 jobs' which implies a validation process for hyperparameter tuning. However, it does not explicitly provide details about training/validation/test *dataset splits* with percentages or counts. |
| Hardware Specification | No | The paper describes the neural network architecture and training process but does not specify any particular hardware (e.g., GPU models, CPU types) used for the experiments. |
| Software Dependencies | No | The paper mentions using specific algorithms and components like 'A3C', 'CNN-LSTM agent', 'RMSprop', and 'LSTM with forget gates (Gers et al., 2000)'. However, it does not list specific software libraries or frameworks with version numbers (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | In all our experiments we used an A3C CNN-LSTM agent as our baseline and the UNREAL agent along with its ablated variants added auxiliary outputs and losses to this base agent. The agent is trained on-policy with 20-step returns and the auxiliary tasks are performed every 20 environment steps, corresponding to every update of the base A3C agent. The replay buffer stores the most recent 2k observations, actions, and rewards taken by the base agent. The agents are optimised over 32 asynchronous threads with shared RMSprop (Mnih et al., 2016). The learning rates are sampled from a log-uniform distribution between 0.0001 and 0.005. The entropy costs are sampled from the log-uniform distribution between 0.0005 and 0.01. |