Dual Policy Distillation

Authors: Kwei-Herng Lai, Daochen Zha, Yuening Li, Xia Hu

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The conducted experiments on several continuous control tasks show that the proposed framework achieves superior performance with a learning-based agent and function approximation without the use of expensive teacher models.
Researcher Affiliation Academia Kwei-Herng Lai , Daochen Zha , Yuening Li and Xia Hu Department of Computer Science and Engineering, Texas A&M University {khlai037, daochen.zha, yueningl, xiahu}@tamu.edu
Pseudocode Yes Algorithm 1 DPD: dual policy distillation
Open Source Code Yes We propose a practical algorithm1 based on our theoretical results. The algorithm uses a disadvantageous policy distillation strategy (...) 1https://github.com/datamllab/dual-policy-distillation
Open Datasets Yes The experiments are conducted on several continuous control tasks from Open AI gym3 [Brockman et al., 2016]: Swimmer-v2, Half Cheetah-v2, Walker2d-v2, Humanoid-v2.
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits. For continuous control tasks in RL, the concept of a static dataset split for validation is often replaced by ongoing evaluation during training or separate test episodes, but no explicit 'validation set' is mentioned.
Hardware Specification No The paper does not explicitly describe the hardware used for running its experiments, such as specific CPU or GPU models.
Software Dependencies No The paper mentions that experiments are implemented upon "PPO [Schulman et al., 2017] and DDPG [Lillicrap et al., 2016], which are benchmark RL algorithms implemented in Open AI baselines2." and links to the OpenAI Baselines GitHub. However, it does not specify version numbers for any software components, libraries, or frameworks to ensure reproducibility.
Experiment Setup No The paper states: "We follow all the hyper-parameters setting and network structures for our DPD implementation and all the baselines we considered." While this implies hyperparameters were used, it does not explicitly provide the concrete values for these settings (e.g., learning rate, batch size, network architectures) in the main text.