Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching
Authors: Mingi Ji, Byeongho Heo, Sungrae Park7945-7952
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments for model compression on three image classification tasks such as CIFAR-100 (Krizhevsky, Hinton et al. 2009), tiny Image Net, and Image Net (Deng et al. 2009) and for domain transfer on four specific tasks such as CUB200 (Wah et al. 2011), MIT67 (Quattoni and Torralba 2009), Stanford40 (Yao et al. 2011), and Stanford Dogs (Khosla et al. 2011) with a pre-trained large network on Image Net. |
| Researcher Affiliation | Collaboration | Mingi Ji 1 , Byeongho Heo 2, Sungrae Park 3 1 Korea Advenced Institute of Science and Technology (KAIST) 2 NAVER AI LAB 3 CLOVA AI Research, NAVER Corp. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The implementation code is available at open sourced1. 1github.com/clovaai/attention-feature-distillation. |
| Open Datasets | Yes | We conduct an experiment on CIFAR-100 that consists of 32 32 sized color images for 100 object classes and has 50K training and 10K validation images. |
| Dataset Splits | Yes | We conduct an experiment on CIFAR-100 that consists of 32 32 sized color images for 100 object classes and has 50K training and 10K validation images. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., Python 3.x, PyTorch 1.x) needed to replicate the experiment. |
| Experiment Setup | Yes | We set the batch size as 64 and the maximum iteration as 240 epochs. All models are trained with stochastic gradient descent with 0.9 of momentum, weight decay as 5 10 4, initial learning rate as 0.05, and divide it by 10 at 150, 180, 210 epochs. |