Knowledge Transfer in Multi-Task Deep Reinforcement Learning for Continuous Control

Authors: Zhiyuan Xu, Kun Wu, Zhengping Che, Jian Tang, Jieping Ye

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a comprehensive empirical study with two commonly-used benchmarks in the Mu Jo Co continuous control task suite. The experimental results well justify the effectiveness of KTM-DRL and its knowledge transfer and online learning algorithms, as well as its superiority over the state-of-the-art by a large margin.
Researcher Affiliation Collaboration Department of Electrical Engineering & Computer Science, Syracuse University Di Di AI Labs, Didi Chuxing
Pseudocode Yes Algorithm 1: KTM-DRL
Open Source Code No The paper does not contain an unambiguous statement of code release or a direct link to a source-code repository for the methodology described.
Open Datasets Yes We conducted extensive experiments with the continuous control tasks in the Mu Jo Co suite [19]. We employed two typical benchmarks (which will be called Benchmarks A and B in the following): 1) Benchmark A: it is called the Half Cheetah task group [7], which includes 8 similar tasks; 2) Benchmark B: it consists of 6 considerably different tasks.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, and testing. It states training epochs and evaluation trials, but not data splits.
Hardware Specification Yes More specifically, it takes about 12 hours for KTM-DRL to finish the 1M end-to-end training with NVIDIA Tesla P100 GPU on Benchmark A
Software Dependencies No The paper mentions software like TD3, DQN, DDPG, SAC, and MuJoCo, but does not provide specific version numbers for these components, which is necessary for reproducible software dependencies.
Experiment Setup Yes In our implementation, the key hyper-parameters were set as follows: α = β = 1, the exploration noise is N(0, 0.1), the clip threshold c = 0.5, and the policy update frequency d = 2. The size of each replay buffer |Dk| and |D k| and a mini-batch is set to 106 and 256, respectively. Reward discount factor γ is set to 0.99, target network update rate τ is set to 0.005, and the learning rate for both actor and critic networks is set to 3 10 4. In KTM-DRL, every DNN consists of only 2 hidden layers with 400 Re LU activated neurons each.