REPAINT: Knowledge Transfer in Deep Reinforcement Learning
Authors: Yunzhe Tao, Sahika Genc, Jonathan Chung, Tao Sun, Sunil Mallya
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on several benchmark tasks show that REPAINT significantly reduces the total training time in generic cases of task similarity. In particular, when the source tasks are dissimilar to, or sub-tasks of, the target tasks, REPAINT outperforms other baselines in both training-time reduction and asymptotic performance of return scores. 6. Experiments |
| Researcher Affiliation | Industry | 1AI Labs, Amazon Web Services, Seattle, WA 98121, USA. Correspondence to: Yunzhe Tao <yunzhe.tao@gmail.com>. |
| Pseudocode | Yes | Algorithm 1 REPAINT with Clipped PPO |
| Open Source Code | No | The paper does not explicitly state that the source code for the methodology is being released or provide a link to a code repository. |
| Open Datasets | Yes | To assess the REPAINT algorithm, we use three platforms across multiple benchmark tasks with increasing complexity for experiments, i.e., Reacher and Ant environments in Mu Jo Co simulator (Todorov, 2016), single-car and multi-car racings in AWS Deep Racer simulator (Balaji et al., 2019), and Build Marines and Find And Defeat Zerglings mini-games in Star Craft II environments (Vinyals et al., 2017). |
| Dataset Splits | No | The paper describes evaluation during training ('evaluate the policy for another 20 episodes'), but does not specify distinct training/validation/test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components and environments like 'Mu Jo Co simulator', 'AWS Deep Racer simulator', 'Star Craft II environments', and 'Clipped PPO' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | In addition, one can find in Section B the hyper-parameters we used for reproducing our results. From Appendix B: We train all policies for 2000 iterations for Deep Racer, 1000 iterations for Mu Jo Co, and 100 iterations for Star Craft II. For Clipped PPO, we use a learning rate of 0.0003 for both actor and critic, an entropy bonus of 0.01, a discount factor of 0.99, and a GAE λ of 0.95. The PPO clip range is 0.2. The mini-batch size is 256. For REPAINT, we set α1 = 1.0 and α2 = 0.01. The teacher policy is always run for 200 iterations for warm-up. We use the cross-entropy weights βk = max(0, 1.0 (k/500.0)2) for Mu Jo Co, and βk = max(0, 1.0 k/5.0) for Deep Racer and Star Craft II. The advantage threshold ζ is 0.1 for Mu Jo Co and 0.5 for Deep Racer and Star Craft II. |