Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Playing hard exploration games by watching YouTube
Authors: Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, Nando de Freitas
NeurIPS 2018 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 8 presents our learning curves for each hard exploration Atari game. Without imitation reward, the pure RL agent is unable to collect any of the sparse rewards in MONTEZUMA S REVENGE and PITFALL!, and only reaches the first two rewards in PRIVATE EYE (consistent with previous studies using DQN variants [19, 22]). Using pixel-space features, the guided agent is able to obtain 17k points in PRIVATE EYE but still fails to make progress in the other games. Replacing a pixel embedding with our combined TDC+CMC embedding convincingly yields the best results, even if the agent is presented only with our TDC+CMC imitation reward (i.e. no environment reward).Finally, in Table 1 we compare our best policies for each game to the best previously published results; Rainbow [19] and Ape X DQN [22] without demonstrations, and DQf D [20] using expert demonstrations. Unlike DQf D our demonstrations are unaligned You Tube footage without access to action or reward trajectories. Our results are calculated using the standard approach of averaging over 200 episodes initialized using a random 1-to-30 no-op actions. Importantly, our approach is the first to convincingly exceed human-level performance on all three games even in the absence of an environment reward signal. |
| Researcher Affiliation | Industry | Yusuf Aytar , Tobias Pfaff , David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas Deep Mind, London, UK EMAIL |
| Pseudocode | No | The paper describes methods in text and uses equations but does not present a formal pseudocode block or algorithm box. |
| Open Source Code | No | Videos of our agent playing these games can be found here2 https://www.youtube.com/playlist?list=PLZu OGGtnt Kla Ooq_8wk5a Kg E_u_Qcpqhu |
| Open Datasets | No | We consider three Atari 2600 games that are considered very difficult exploration challenges: MONTEZUMA S REVENGE, PITFALL! and PRIVATE EYE. For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes. Importantly, none of the You Tube videos were collected using our specific Arcade Learning Environment [10]... |
| Dataset Splits | No | For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory) are mentioned in the paper. |
| Software Dependencies | No | The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps. As described in Section 4, our imitation loss is constructed by generating checkpoints every N = 16 frames along the φ-embedded observation sequence of a single You Tube video. We train an agent using the sum of imitation and (optionally) environment rewards. We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments. |
| Experiment Setup | Yes | The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps.We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments.We also set t = 1 and α = 0.5 for our experiments (except when considering pixel-only embeddings, where α = 0.92 provided the best performance). |