reproducibilityindex.ai

Playing hard exploration games by watching YouTube

Authors: Yusuf Aytar, Tobias Pfaff, David Budden, Thomas Paine, Ziyu Wang, Nando de Freitas

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 8 presents our learning curves for each hard exploration Atari game. Without imitation reward, the pure RL agent is unable to collect any of the sparse rewards in MONTEZUMA S REVENGE and PITFALL!, and only reaches the first two rewards in PRIVATE EYE (consistent with previous studies using DQN variants [19, 22]). Using pixel-space features, the guided agent is able to obtain 17k points in PRIVATE EYE but still fails to make progress in the other games. Replacing a pixel embedding with our combined TDC+CMC embedding convincingly yields the best results, even if the agent is presented only with our TDC+CMC imitation reward (i.e. no environment reward).Finally, in Table 1 we compare our best policies for each game to the best previously published results; Rainbow [19] and Ape X DQN [22] without demonstrations, and DQf D [20] using expert demonstrations. Unlike DQf D our demonstrations are unaligned You Tube footage without access to action or reward trajectories. Our results are calculated using the standard approach of averaging over 200 episodes initialized using a random 1-to-30 no-op actions. Importantly, our approach is the first to convincingly exceed human-level performance on all three games even in the absence of an environment reward signal.
Researcher Affiliation	Industry	Yusuf Aytar , Tobias Pfaff , David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas Deep Mind, London, UK {yusufaytar,tpfaff,budden,tpaine,ziyu,nandodefreitas}@google.com
Pseudocode	No	The paper describes methods in text and uses equations but does not present a formal pseudocode block or algorithm box.
Open Source Code	No	Videos of our agent playing these games can be found here2 https://www.youtube.com/playlist?list=PLZu OGGtnt Kla Ooq_8wk5a Kg E_u_Qcpqhu
Open Datasets	No	We consider three Atari 2600 games that are considered very difﬁcult exploration challenges: MONTEZUMA S REVENGE, PITFALL! and PRIVATE EYE. For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes. Importantly, none of the You Tube videos were collected using our speciﬁc Arcade Learning Environment [10]...
Dataset Splits	No	For each, we select four You Tube videos (three training and one test) of human gameplay, varying in duration from 3-to-10 minutes.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory) are mentioned in the paper.
Software Dependencies	No	The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps. As described in Section 4, our imitation loss is constructed by generating checkpoints every N = 16 frames along the φ-embedded observation sequence of a single You Tube video. We train an agent using the sum of imitation and (optionally) environment rewards. We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments.
Experiment Setup	Yes	The model is trained with Adam using a learning rate of 10 4 and batch size of 32 for 200,000 steps.We use the distributed A3C RL agent IMPALA [14] with 100 actors for our experiments.We also set t = 1 and α = 0.5 for our experiments (except when considering pixel-only embeddings, where α = 0.92 provided the best performance).