Understanding the Role of the Projector in Knowledge Distillation
Authors: Roy Miles, Krystian Mikolajczyk
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and Image Net), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with Dei T-Ti on Image Net. |
| Researcher Affiliation | Academia | Roy Miles, Krystian Mikolajczyk Imperial College London r.miles18@imperial.ac.uk, k.mikolajczyk@imperial.ac.uk |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models are publicly available. |
| Open Datasets | Yes | Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and Image Net), object detection (COCO2017)... |
| Dataset Splits | No | Experiments on the CIFAR-100 classification task (Krizhevsky 2009) consist of 60K 32 32 RGB images across 100 classes with a 5:1 training/testing split. (Missing explicit validation split). The Image Net (Russakovsky et al. 2014) classification uses 1.3 million images (no explicit split). Image Net-1K 20% subset (still not a full train/val/test split). |
| Hardware Specification | Yes | All experiments were performed on a single NVIDIA RTX A5000. |
| Software Dependencies | No | The paper mentions using 'torchdistill', 'CRD code', and 'co-advice code' but does not specify their version numbers or other software dependencies with explicit version details. |
| Experiment Setup | Yes | We follow the same training schedule as CRD (Tian, Krishnan, and Isola 2019) for both the CIFAR100 and Image Net experiments. For the object detection, we use the same training schedule as Review KD (Chen et al. 2021a), while for the data efficient training we use the same as Co-Advice (Ren et al. 2022). All experiments were performed on a single NVIDIA RTX A5000. When using batch normalisation for the representations, we removed the affine parameters and set ϵ = 0.0001. For all experiments we jointly train the student using a task loss Ltask and the feature distillation loss given in equation 11. ... We use the exact same training methodology as co-advice (Ren et al. 2022) and choose to use batch normalisation, a linear projection layer, and α = 4 as the parameters for distillation. ... The input size are set to 224 x 224, and we employed a typical augmentation procedure that includes cropping and horizontal flipping. We used the torchdistill library with the standard configuration, which involves 100 training epochs using SGD and an initial learning rate of 0.1, which is decreased by a factor of 10 at epochs 30, 60, and 90. |