Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning

Authors: Emilio Parisotto, Jimmy Ba, Ruslan Salakhutdinov

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods. In the following experiments, we validate the Actor-Mimic method by demonstrating its effectiveness at both multitask and transfer learning in the Arcade Learning Environment (ALE).
Researcher Affiliation Academia Emilio Parisotto, Jimmy Ba, Ruslan Salakhutdinov Department of Computer Science University of Toronto Toronto, Ontario, Canada {eparisotto,jimmy,rsalakhu}@cs.toronto.edu
Pseudocode No The paper describes the Actor-Mimic method through textual explanations and mathematical formulations, but it does not present a dedicated pseudocode block or algorithm listing.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods. The Arcade Learning Environment (ALE) (Bellemare et al., 2013)
Dataset Splits No The paper describes training and testing epochs for evaluation and uses a replay memory, but it does not specify explicit train/validation/test dataset splits (e.g., in percentages or sample counts) for reproducibility.
Hardware Specification Yes Processing 5 million frames with the large model is equivalent to around 4 days of compute time on a NVIDIA GTX Titan.
Software Dependencies No All of our Actor-Mimic Networks (AMNs) were trained using the Adam (Kingma & Ba, 2015) optimization algorithm. For the experiments using the DQN algorithm, we optimize the networks with RMSProp. The paper mentions software tools like Adam and RMSProp but does not specify their version numbers.
Experiment Setup Yes For the transfer experiments with the feature regression objective, we set the scaling parameter β to 0.01 and the feature prediction network fi was set to a linear projection from the AMN features to the ith expert features. For the policy regression objective, we use a softmax temperature of 1 in all cases. Additionally, during training for all AMNs we use an ϵ-greedy policy with ϵ set to a constant 0.1. For AMNs we use a per-game 100,000 frame replay memory. We use the full 1,000,000 frame replay memory when training any DQN.