Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

Authors: Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan, Weixun Wang, Wulong Liu, Zhaodong Wang, Jiajie Peng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental experimental results show it significantly accelerates RL and surpasses state-of-the-art policy transfer methods in both discrete and continuous action spaces.
Researcher Affiliation Collaboration 1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning 4Nanjing University 5Fuxi AI Lab in Netease 6JD Digits {tpyang,jianye.hao,mengzp}@tju.edu.cn, zzzhang@nju.edu.cn, {huyujing,chenyingfeng1,fanchangjie}@corp.netease.com, wxwang@tju.edu.cn, liuwulong@huawei.com, zhaodong.wang@jd.com, jiajiep@gmail.com
Pseudocode Yes Algorithm 1 PTF-A3C
Open Source Code Yes 1The source code and supplementary materials are put on https: //github.com/PTF-transfer/Code PTF.
Open Datasets Yes In this section, we evaluate PTF on three domains, grid world [Li et al., 2019], pinball [Bacon et al., 2017] and reacher [Tassa et al., 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al., 2016] and PPO [Schulman et al., 2017]); and the state-of-the-art policy transfer method CAPS [Li et al., 2019], implemented as a deep version (Deep-CAPS).
Dataset Splits No The paper mentions 'Results are averaged over 20 random seeds' but does not specify explicit train/validation/test dataset splits by percentage or sample count.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions DRL methods like A3C and PPO, and environments like MuJoCo, but does not provide specific version numbers for any software dependencies or libraries used for implementation.
Experiment Setup Yes We set ξ = 0.005, f(t) = 0.5+tanh(3 0.001 t)/21. ... The episode terminates with a +10000 reward when the agent reaches the target. We interrupt any episode taking more than 500 steps and set the discount factor to 0.99.