Playing hard exploration games by watching YouTube
Authors: Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, Nando de Freitas
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 8 presents our learning curves for each hard exploration Atari game. Without imitation reward, the pure RL agent is unable to collect any of the sparse rewards in MONTEZUMA S REVENGE and PITFALL!, and only reaches the first two rewards in PRIVATE EYE (consistent with previous studies using DQN variants [19, 22]). Using pixel-space features, the guided agent is able to obtain 17k points in PRIVATE EYE but still fails to make progress in the other games. Replacing a pixel embedding with our combined TDC+CMC embedding convincingly yields the best results, even if the agent is presented only with our TDC+CMC imitation reward (i.e. no environment reward).Finally, in Table 1 we compare our best policies for each game to the best previously published results; Rainbow [19] and Ape X DQN [22] without demonstrations, and DQf D [20] using expert demonstrations. Unlike DQf D our demonstrations are unaligned You Tube footage without access to action or reward trajectories. Our results are calculated using the standard approach of averaging over 200 episodes initialized using a random 1-to-30 no-op actions. Importantly, our approach is the first to convincingly exceed human-level performance on all three games even in the absence of an environment reward signal. |
| Researcher Affiliation | Industry | Yusuf Aytar , Tobias Pfaff , David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas Deep Mind, London, UK {yusufaytar,tpfaff,budden,tpaine,ziyu,nandodefreitas}@google.com |
| Pseudocode | No | The paper describes methods in text and uses equations but does not present a formal pseudocode block or algorithm box. |
| Open Source Code | No | Videos of our agent playing these games can be found here2 https://www.youtube.com/playlist?list=PLZu OGGtnt Kla Ooq_8wk5a Kg E_u_Qcpqhu |
| Open Datasets | No | We consider three Atari 2600 games that are considered very difficult exploration challenges: MONTEZUMA S REVENGE, PITFALL! and PRIVATE EYE. For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes. Importantly, none of the You Tube videos were collected using our specific Arcade Learning Environment [10]... |
| Dataset Splits | No | For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory) are mentioned in the paper. |
| Software Dependencies | No | The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps. As described in Section 4, our imitation loss is constructed by generating checkpoints every N = 16 frames along the φ-embedded observation sequence of a single You Tube video. We train an agent using the sum of imitation and (optionally) environment rewards. We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments. |
| Experiment Setup | Yes | The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps.We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments.We also set t = 1 and α = 0.5 for our experiments (except when considering pixel-only embeddings, where α = 0.92 provided the best performance). |