reproducibilityindex.ai

Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

Authors: Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan, Weixun Wang, Wulong Liu, Zhaodong Wang, Jiajie Peng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	experimental results show it signiﬁcantly accelerates RL and surpasses state-of-the-art policy transfer methods in both discrete and continuous action spaces.
Researcher Affiliation	Collaboration	1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning 4Nanjing University 5Fuxi AI Lab in Netease 6JD Digits {tpyang,jianye.hao,mengzp}@tju.edu.cn, zzzhang@nju.edu.cn, {huyujing,chenyingfeng1,fanchangjie}@corp.netease.com, wxwang@tju.edu.cn, liuwulong@huawei.com, zhaodong.wang@jd.com, jiajiep@gmail.com
Pseudocode	Yes	Algorithm 1 PTF-A3C
Open Source Code	Yes	1The source code and supplementary materials are put on https: //github.com/PTF-transfer/Code PTF.
Open Datasets	Yes	In this section, we evaluate PTF on three domains, grid world [Li et al., 2019], pinball [Bacon et al., 2017] and reacher [Tassa et al., 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al., 2016] and PPO [Schulman et al., 2017]); and the state-of-the-art policy transfer method CAPS [Li et al., 2019], implemented as a deep version (Deep-CAPS).
Dataset Splits	No	The paper mentions 'Results are averaged over 20 random seeds' but does not specify explicit train/validation/test dataset splits by percentage or sample count.
Hardware Specification	No	The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory).
Software Dependencies	No	The paper mentions DRL methods like A3C and PPO, and environments like MuJoCo, but does not provide specific version numbers for any software dependencies or libraries used for implementation.
Experiment Setup	Yes	We set ξ = 0.005, f(t) = 0.5+tanh(3 0.001 t)/21. ... The episode terminates with a +10000 reward when the agent reaches the target. We interrupt any episode taking more than 500 steps and set the discount factor to 0.99.