Policy Distillation
Authors: Andrei Rusu, Sergio Gomez, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, Volodymyr Mnih, Koray Kavukcuoglu, Raia Hadsell
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent. 4 RESULTS AND DISCUSSION |
| Researcher Affiliation | Collaboration | Google Deep Mind London, UK {andreirusu, sergomez, gdesjardins, kirkpatrick, razp, vmnih, korayk, raia}@google.com, gulcehrc@iro.umontreal.ca While interning at Google Deep Mind. Other affiliation: Université de Montréal, Montréal, Canada. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code, nor does it mention that source code is available. |
| Open Datasets | Yes | We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent. Ten popular Atari games were selected and fixed before starting this research. The network used to train the DQN agents is described in (Mnih et al., 2015). |
| Dataset Splits | No | The paper describes data collection and training but does not specify a distinct validation dataset split with percentages or sample counts. |
| Hardware Specification | No | The paper mentions 'Using modern GPUs' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions 'Rms Prop' as a training method but does not provide specific ancillary software details with version numbers (e.g., programming language versions, library versions, or solver versions). |
| Experiment Setup | Yes | Online data collection during policy distillation was performed under similar conditions to agent evaluation in Mnih et al. (2015). The DQN agent plays a random number of null-ops (up to 30) to initialize the episode, then acts greedily with respect to its Q-function, except for 5% of actions, which are chosen uniformly at random. Results were robust for primary learning rates between 1.0e-4 and 1.0e-3, with maximum learning rates between 1.0e-3 and 1.0e-1. We chose hyper-parameters using preliminary experiments on 4 games. The reported results consumed 500 hours of teacher gameplay to train each student, less than 50% of the amount that was used to train each DQN teacher. Using modern GPUs we can refresh the replay memory and train the students much faster than real-time, with typical convergence in a few days. With multi-task students we used separate replay memories for each game, with the same capacity of 10 hours, and the respective DQN teachers took turns adding data. After one hour of gameplay the student is trained with 10,000 minibatch updates (each minibatch is drawn from a randomly chosen single game memory). The same 500 hour budget of gameplay was used for all but the largest network, which used 34,000 hours of experience over 10 games. Distillation Targets Using DQN outputs we have defined three types of training targets that correspond to the three distillation loss functions discussed in Section3. First, the teacher’s Q-values for all actions were used directly as supervised targets; thus, training the student consisted of minimizing the mean squared error (MSE) between the student’s and teacher’s outputs for each input state. Second, we used only the teacher’s highest valued action as a one-hot target. Naturally, we minimized the negative log likelihood (NLL) loss. Finally, we passed Q-values through a softmax function whose temperature (τ = 0.01) was selected empirically from [1.0, 0.1, 0.01, 0.001]. Network Architectures Details of the architectures used by DQN and single-task distilled agents are given in table A1. Rectifier non-linearities were added between each two consecutive layers. We used one unit for each valid action in the output layer, which was linear. A final softmax operation was performed when distilling with NLL and KL loss functions. |