Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Efficient Deep Reinforcement Learning via Adaptive Policy Transfer
Authors: Tianpei Yang, Jianye Hao, Zhaopeng Meng, Zongzhang Zhang, Yujing Hu, Yingfeng Chen, Changjie Fan, Weixun Wang, Wulong Liu, Zhaodong Wang, Jiajie Peng
IJCAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | experimental results show it significantly accelerates RL and surpasses state-of-the-art policy transfer methods in both discrete and continuous action spaces. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning 4Nanjing University 5Fuxi AI Lab in Netease 6JD Digits EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 PTF-A3C |
| Open Source Code | Yes | 1The source code and supplementary materials are put on https: //github.com/PTF-transfer/Code PTF. |
| Open Datasets | Yes | In this section, we evaluate PTF on three domains, grid world [Li et al., 2019], pinball [Bacon et al., 2017] and reacher [Tassa et al., 2018] compared with several DRL methods learning from scratch (A3C [Mnih et al., 2016] and PPO [Schulman et al., 2017]); and the state-of-the-art policy transfer method CAPS [Li et al., 2019], implemented as a deep version (Deep-CAPS). |
| Dataset Splits | No | The paper mentions 'Results are averaged over 20 random seeds' but does not specify explicit train/validation/test dataset splits by percentage or sample count. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions DRL methods like A3C and PPO, and environments like MuJoCo, but does not provide specific version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | We set ξ = 0.005, f(t) = 0.5+tanh(3 0.001 t)/21. ... The episode terminates with a +10000 reward when the agent reaches the target. We interrupt any episode taking more than 500 steps and set the discount factor to 0.99. |