reproducibilityindex.ai

REPAINT: Knowledge Transfer in Deep Reinforcement Learning

Authors: Yunzhe Tao, Sahika Genc, Jonathan Chung, Tao Sun, Sunil Mallya

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results on several benchmark tasks show that REPAINT signiﬁcantly reduces the total training time in generic cases of task similarity. In particular, when the source tasks are dissimilar to, or sub-tasks of, the target tasks, REPAINT outperforms other baselines in both training-time reduction and asymptotic performance of return scores. 6. Experiments
Researcher Affiliation	Industry	1AI Labs, Amazon Web Services, Seattle, WA 98121, USA. Correspondence to: Yunzhe Tao <yunzhe.tao@gmail.com>.
Pseudocode	Yes	Algorithm 1 REPAINT with Clipped PPO
Open Source Code	No	The paper does not explicitly state that the source code for the methodology is being released or provide a link to a code repository.
Open Datasets	Yes	To assess the REPAINT algorithm, we use three platforms across multiple benchmark tasks with increasing complexity for experiments, i.e., Reacher and Ant environments in Mu Jo Co simulator (Todorov, 2016), single-car and multi-car racings in AWS Deep Racer simulator (Balaji et al., 2019), and Build Marines and Find And Defeat Zerglings mini-games in Star Craft II environments (Vinyals et al., 2017).
Dataset Splits	No	The paper describes evaluation during training ('evaluate the policy for another 20 episodes'), but does not specify distinct training/validation/test dataset splits with percentages or counts.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions software components and environments like 'Mu Jo Co simulator', 'AWS Deep Racer simulator', 'Star Craft II environments', and 'Clipped PPO' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	In addition, one can ﬁnd in Section B the hyper-parameters we used for reproducing our results. From Appendix B: We train all policies for 2000 iterations for Deep Racer, 1000 iterations for Mu Jo Co, and 100 iterations for Star Craft II. For Clipped PPO, we use a learning rate of 0.0003 for both actor and critic, an entropy bonus of 0.01, a discount factor of 0.99, and a GAE λ of 0.95. The PPO clip range is 0.2. The mini-batch size is 256. For REPAINT, we set α1 = 1.0 and α2 = 0.01. The teacher policy is always run for 200 iterations for warm-up. We use the cross-entropy weights βk = max(0, 1.0 (k/500.0)2) for Mu Jo Co, and βk = max(0, 1.0 k/5.0) for Deep Racer and Star Craft II. The advantage threshold ζ is 0.1 for Mu Jo Co and 0.5 for Deep Racer and Star Craft II.