Prompting to Distill: Boosting Data-Free Knowledge Distillation via Reinforced Prompt
Authors: Xinyin Ma, Xinchao Wang, Gongfan Fang, Yongliang Shen, Weiming Lu
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance. We conduct experiments on four text classification datasets to validate the efficacy of Prompt DFD: AG News [Zhang et al., 2015], DBPedia [Auer et al., 2007], IMDb [Maas et al., 2011] and SST-2 [Socher et al., 2013]. Table 3 and Table 4 compare the accuracy and the teacher-student agreement of different data-free distillation algorithms. |
| Researcher Affiliation | Academia | Xinyin Ma1 , Xinchao Wang2 , Gongfan Fang1 , Yongliang Shen1 and Weiming Lu1 1College of Computer Science and Technology, Zhejiang University 2Department of Electrical and Computer Engineering, National University of Singapore |
| Pseudocode | Yes | Algorithm 1: The training procedure of Prompt DFD |
| Open Source Code | No | The paper references 'Distil GPT2' with a link to huggingface.co/distilgpt2, which is a third-party open-source model used by the authors. However, there is no explicit statement or link indicating that the authors have released the source code for their own proposed methodology (Prompt DFD). |
| Open Datasets | Yes | We conduct experiments on four text classification datasets to validate the efficacy of Prompt DFD: AG News [Zhang et al., 2015], DBPedia [Auer et al., 2007], IMDb [Maas et al., 2011] and SST-2 [Socher et al., 2013]. |
| Dataset Splits | Yes | Table 4: Performance on the dev set of SST-2. The epoch for training is set to 10, the temperature of KD is selected from {1, 5, 10}, and α is selected from {0.5, 0.9}. While exact percentages are not stated, the mention of 'dev set' for SST-2 implies a standard validation split commonly used for these benchmarks. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions software like 'BERT-base', 'GPT2', and 'Distil GPT2', but it does not specify any version numbers for these or other software dependencies. |
| Experiment Setup | Yes | A grid search on parameters is performed in our experiment, where the learning rate of the student is selected from {1e-5, 2e-5, 5e-5, 1e-4}, the learning rate of the Gpis selected from {5e-6, 1e-5, 2e-5}, the batch size is selected from {128, 256}, the epoch for training is set to 10, the temperature of KD is selected from {1, 5, 10}, and α is selected from {0.5, 0.9}. Adam is used to optimize the student network and the topic prompter. The top 50 tokens with the highest probability are selected for decoding and the threshold for top-p is set to 0.95. |