Efficient Deep Reinforcement Learning via Adaptive Policy Transfer
Authors: Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan, Weixun Wang, Wulong Liu, Zhaodong Wang, Jiajie Peng
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | experimental results show it significantly accelerates RL and surpasses state-of-the-art policy transfer methods in both discrete and continuous action spaces. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning 4Nanjing University 5Fuxi AI Lab in Netease 6JD Digits {tpyang,jianye.hao,mengzp}@tju.edu.cn, zzzhang@nju.edu.cn, {huyujing,chenyingfeng1,fanchangjie}@corp.netease.com, wxwang@tju.edu.cn, liuwulong@huawei.com, zhaodong.wang@jd.com, jiajiep@gmail.com |
| Pseudocode | Yes | Algorithm 1 PTF-A3C |
| Open Source Code | Yes | 1The source code and supplementary materials are put on https: //github.com/PTF-transfer/Code PTF. |
| Open Datasets | Yes | In this section, we evaluate PTF on three domains, grid world [Li et al., 2019], pinball [Bacon et al., 2017] and reacher [Tassa et al., 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al., 2016] and PPO [Schulman et al., 2017]); and the state-of-the-art policy transfer method CAPS [Li et al., 2019], implemented as a deep version (Deep-CAPS). |
| Dataset Splits | No | The paper mentions 'Results are averaged over 20 random seeds' but does not specify explicit train/validation/test dataset splits by percentage or sample count. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions DRL methods like A3C and PPO, and environments like MuJoCo, but does not provide specific version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | We set ξ = 0.005, f(t) = 0.5+tanh(3 0.001 t)/21. ... The episode terminates with a +10000 reward when the agent reaches the target. We interrupt any episode taking more than 500 steps and set the discount factor to 0.99. |