Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning
Authors: Emilio Parisotto, Jimmy Ba, Ruslan Salakhutdinov
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods. In the following experiments, we validate the Actor-Mimic method by demonstrating its effectiveness at both multitask and transfer learning in the Arcade Learning Environment (ALE). |
| Researcher Affiliation | Academia | Emilio Parisotto, Jimmy Ba, Ruslan Salakhutdinov Department of Computer Science University of Toronto Toronto, Ontario, Canada {eparisotto,jimmy,rsalakhu}@cs.toronto.edu |
| Pseudocode | No | The paper describes the Actor-Mimic method through textual explanations and mathematical formulations, but it does not present a dedicated pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | Although our method can in general be applied to a wide range of problems, we use Atari games as a testing environment to demonstrate these methods. The Arcade Learning Environment (ALE) (Bellemare et al., 2013) |
| Dataset Splits | No | The paper describes training and testing epochs for evaluation and uses a replay memory, but it does not specify explicit train/validation/test dataset splits (e.g., in percentages or sample counts) for reproducibility. |
| Hardware Specification | Yes | Processing 5 million frames with the large model is equivalent to around 4 days of compute time on a NVIDIA GTX Titan. |
| Software Dependencies | No | All of our Actor-Mimic Networks (AMNs) were trained using the Adam (Kingma & Ba, 2015) optimization algorithm. For the experiments using the DQN algorithm, we optimize the networks with RMSProp. The paper mentions software tools like Adam and RMSProp but does not specify their version numbers. |
| Experiment Setup | Yes | For the transfer experiments with the feature regression objective, we set the scaling parameter β to 0.01 and the feature prediction network fi was set to a linear projection from the AMN features to the ith expert features. For the policy regression objective, we use a softmax temperature of 1 in all cases. Additionally, during training for all AMNs we use an ϵ-greedy policy with ϵ set to a constant 0.1. For AMNs we use a per-game 100,000 frame replay memory. We use the full 1,000,000 frame replay memory when training any DQN. |