Deep Q-learning From Demonstrations

Authors: Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Leibo, Audrunas Gruslys

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQf D), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQf D works by combining temporal difference updates with supervised classification of the demonstrator s actions. We show that DQf D has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 83 million steps to catch up to DQf D s performance. DQf D learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQf D leverages human demonstrations to achieve state-of-the-art results for 11 games. Finally, we show that DQf D performs better than three related algorithms for incorporating demonstration data into DQN.
Researcher Affiliation Industry Todd Hester Google Deep Mind toddhester@google.com Matej Vecerik Google Deep Mind matejvecerik@google.com Olivier Pietquin Google Deep Mind pietquin@google.com Marc Lanctot Google Deep Mind lanctot@google.com Tom Schaul Google Deep Mind schaul@google.com Bilal Piot Google Deep Mind piot@google.com Dan Horgan Google Deep Mind horgan@google.com John Quan Google Deep Mind johnquan@google.com Andrew Sendonaris Google Deep Mind sendos@yahoo.com Ian Osband Google Deep Mind iosband@google.com Gabriel Dulac-Arnold Google Deep Mind gabe@squirrelsoup.net John Agapiou Google Deep Mind jagapiou@google.com Joel Z. Leibo Google Deep Mind jzl@google.com Audrunas Gruslys Google Deep Mind audrunas@google.com
Pseudocode Yes Pseudo-code is sketched in Algorithm 1.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. It provides a link to a video, but not to code.
Open Datasets Yes We evaluated DQf D on the Arcade Learning Environment (ALE) (Bellemare et al. 2013). ALE is a set of Atari games that are a standard benchmark for DQN and contains many games on which humans still perform better than the best learning agents.
Dataset Splits No The paper states 'We performed informal parameter tuning for all the algorithms on six Atari games and then used the same parameters for the entire set of games.' This implies a form of validation, but it does not specify a separate data split for validation in the conventional sense (e.g., specific percentages or sample counts for a validation set).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions software components like 'dueling state-advantage convolutional network architecture (Wang et al. 2016)' but does not specify any version numbers for libraries, frameworks, or languages used (e.g., TensorFlow 1.x, Python 3.x).
Experiment Setup Yes The agent applies a discount factor of 0.99 and all of its actions are repeated for four Atari frames. Each episode is initialized with up to 30 no-op actions to provide random starting positions. The parameters used for the algorithms are shown in the appendix.