Prompting to Distill: Boosting Data-Free Knowledge Distillation via Reinforced Prompt

Authors: Xinyin Ma, Xinchao Wang, Gongfan Fang, Yongliang Shen, Weiming Lu

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance. We conduct experiments on four text classification datasets to validate the efficacy of Prompt DFD: AG News [Zhang et al., 2015], DBPedia [Auer et al., 2007], IMDb [Maas et al., 2011] and SST-2 [Socher et al., 2013]. Table 3 and Table 4 compare the accuracy and the teacher-student agreement of different data-free distillation algorithms.
Researcher Affiliation Academia Xinyin Ma1 , Xinchao Wang2 , Gongfan Fang1 , Yongliang Shen1 and Weiming Lu1 1College of Computer Science and Technology, Zhejiang University 2Department of Electrical and Computer Engineering, National University of Singapore
Pseudocode Yes Algorithm 1: The training procedure of Prompt DFD
Open Source Code No The paper references 'Distil GPT2' with a link to huggingface.co/distilgpt2, which is a third-party open-source model used by the authors. However, there is no explicit statement or link indicating that the authors have released the source code for their own proposed methodology (Prompt DFD).
Open Datasets Yes We conduct experiments on four text classification datasets to validate the efficacy of Prompt DFD: AG News [Zhang et al., 2015], DBPedia [Auer et al., 2007], IMDb [Maas et al., 2011] and SST-2 [Socher et al., 2013].
Dataset Splits Yes Table 4: Performance on the dev set of SST-2. The epoch for training is set to 10, the temperature of KD is selected from {1, 5, 10}, and α is selected from {0.5, 0.9}. While exact percentages are not stated, the mention of 'dev set' for SST-2 implies a standard validation split commonly used for these benchmarks.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions software like 'BERT-base', 'GPT2', and 'Distil GPT2', but it does not specify any version numbers for these or other software dependencies.
Experiment Setup Yes A grid search on parameters is performed in our experiment, where the learning rate of the student is selected from {1e-5, 2e-5, 5e-5, 1e-4}, the learning rate of the Gpis selected from {5e-6, 1e-5, 2e-5}, the batch size is selected from {128, 256}, the epoch for training is set to 10, the temperature of KD is selected from {1, 5, 10}, and α is selected from {0.5, 0.9}. Adam is used to optimize the student network and the topic prompter. The top 50 tokens with the highest probability are selected for decoding and the threshold for top-p is set to 0.95.