Knowledge Transfer for Deep Reinforcement Learning with Hierarchical Experience Replay
Authors: Haiyan Yin, Sinno Pan
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use Atari 2600 games as testing environment to demonstrate the efficiency and effectiveness of our proposed solution for policy distillation. |
| Researcher Affiliation | Academia | Haiyan Yin, Sinno Jialin Pan School of Computer Science and Engineering Nanyang Technological University, Singapore {haiyanyin, sinnopan}@ntu.edu.sg |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that source code for the described methodology is publicly available. |
| Open Datasets | Yes | We use Atari 2600 games as testing environment to demonstrate the efficiency and effectiveness of our proposed solution for policy distillation. |
| Dataset Splits | No | The paper describes how experiences are generated and used for training, and how evaluation is performed periodically, but it does not specify traditional training/validation/test dataset splits with percentages or fixed counts from a static dataset, which is common in supervised learning. |
| Hardware Specification | No | The paper vaguely mentions "With modern GPUs" but does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions the use of the RMSProp algorithm for optimization, and refers to DQN, but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The network architecture used to train the single-task teacher DQN is identical to (Mnih et al. 2015). For student network, we used the proposed architecture as shown in Figure 1, where the convolutional layers from teacher networks are used to generate task-specific input features with a dimension size of 3,136. Moreover, the student network has two fully connected layers, with each consisting of 1,028 and 512 neurons respectively, and an output layer of 18 units. Each output corresponds to one control action in Atari games. The value for ϵ linearly decays from 1 to 0.1 within first 1 million steps. The student performs one mini-batch update by sampling experience from each teacher’s replay memory at every 4 steps of playing. When using hierarchical prioritized experience replay, the number of partitions for each replay memory is set to be 5. Each partition can store up to 200,000 experiences. When using uniform sampling, the replay memory capacity is set to be 500,000. To avoid the agent from memorizing the steps, a random number of null operations (up to 30) are generated at the start of each episode. |