Dual Policy Distillation
Authors: Kwei-Herng Lai, Daochen Zha, Yuening Li, Xia Hu
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models. |
| Researcher Affiliation | Academia | Kwei-Herng Lai , Daochen Zha , Yuening Li and Xia Hu Department of Computer Science and Engineering, Texas A&M University {khlai037, daochen.zha, yueningl, xiahu}@tamu.edu |
| Pseudocode | Yes | Algorithm 1 DPD: dual policy distillation |
| Open Source Code | Yes | We propose a practical algorithm1 based on our theoretical results. The algorithm uses a disadvantageous policy distillation strategy (...) 1https://github.com/datamllab/dual-policy-distillation |
| Open Datasets | Yes | The experiments are conducted on several continuous control tasks from Open AI gym3 [Brockman et al., 2016]: Swimmer-v2, Half Cheetah-v2, Walker2d-v2, Humanoid-v2. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits. For continuous control tasks in RL, the concept of a static dataset split for validation is often replaced by ongoing evaluation during training or separate test episodes, but no explicit 'validation set' is mentioned. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running its experiments, such as specific CPU or GPU models. |
| Software Dependencies | No | The paper mentions that experiments are implemented upon "PPO [Schulman et al., 2017] and DDPG [Lillicrap et al., 2016], which are benchmark RL algorithms implemented in Open AI baselines2." and links to the OpenAI Baselines GitHub. However, it does not specify version numbers for any software components, libraries, or frameworks to ensure reproducibility. |
| Experiment Setup | No | The paper states: "We follow all the hyper-parameters setting and network structures for our DPD implementation and all the baselines we considered." While this implies hyperparameters were used, it does not explicitly provide the concrete values for these settings (e.g., learning rate, batch size, network architectures) in the main text. |