Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation
Authors: Martin Zong, Zengyu Qiu, Xinzhu Ma, Kunlin Yang, Chunya Liu, Jun Hou, Shuai Yi, Wanli Ouyang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the proposed method, we conduct extensive experiments on two image classification benchmarks (i.e. CIFAR100 and Image Net) and an object detection benchmark (i.e. MS COCO). The results demonstrate the superiority of our method in performance under varying settings. We conduct extensive experiments on image classification and object detection. Moreover, we present various ablations and analysis for the proposed method. |
| Researcher Affiliation | Collaboration | Martin Zong 1, Zengyu Qiu 1, Xinzhu Ma 2,3, Kunlin Yang 1, Chunya Liu1, Jun Hou1, Shuai Yi1, and Wanli Ouyang2,3 1Sense Time Research, 2Shanghai AI Lab, 3The University of Sydney {zongdaoming,qiuzengyu,yangkunlin,liuchunya,houjun,yishuai}@sensetime.com, {xinzhu.ma, wanli.ouyang}@sydney.edu.au |
| Pseudocode | No | The paper describes the methodology using prose and diagrams but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | Our code will be released at https://github.com/Cuibaby/DPK. Besides, our codes as well as training recipes will be publicly available for reproducibility. |
| Open Datasets | Yes | CIFAR-100 (Krizhevsky et al., 2009) contains 50K images for training and 10K images for testing, labeled into 100 fine-grained categories. Image Net (Deng et al., 2009) consists of 1.2M images for training and 50K images for validation, covering 1,000 categories. MS COCO (Lin et al., 2014) is the most commonly used object detection benchmark, which contains 80 categories. In particular, we use train2017 (118K images) for training, and test on val2017 (5K images). |
| Dataset Splits | Yes | CIFAR-100 (Krizhevsky et al., 2009) contains 50K images for training and 10K images for testing. Image Net (Deng et al., 2009) consists of 1.2M images for training and 50K images for validation, covering 1,000 categories. MS COCO (Lin et al., 2014) is the most commonly used object detection benchmark, which contains 80 categories. In particular, we use train2017 (118K images) for training, and test on val2017 (5K images). |
| Hardware Specification | Yes | All experiments are conducted on 8 Tesla V100 GPUs |
| Software Dependencies | No | Our implementation on MS-COCO for object detection follows the same setting used in (Yang et al., 2022a). We adopt mean squared error (MSE) as the feature distillation loss Dfeat. All experiments are conducted on 8 Tesla V100 GPUs, and our implementation is based on mmdetection framework (Chen et al., 2019). No specific version numbers for software are provided. |
| Experiment Setup | Yes | Training details. On CIFAR-100, the batch size and initial learning rate are set to 64 and 0.05. We train the models for 240 epochs in total with SGD optimizer, and decay the learning rate by 0.1 at 150, 180, and 210 epochs. The weight decay and the momentum are set to 5e-4 and 0.9. On Image Net, we adopt the SGD optimizer (with 0.9 momentum) to train the student networks for 100 epochs with a batch size of 256. The learning rate is set to 0.1, and we decay it by 0.5 every 25 epochs. We set the weight decay to 0.0001. We also apply the vanilla logits distillation loss (Hinton et al., 2015) in our method. For the loss weights in Eqn. equation 3, we set α = 0.8 and β = 0.2 for all experiments. The temperature τ used on the Image Net dataset is set to 1.0, and the same parameter on the CIFAR-100 dataset is set to 4.0. The loss weights of each stage is set to 1.0 in the multi-stage feature distillation setting. |