Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation

Authors: Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, Yunhe Wang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on the Image Net-1k classification task. The proposed manifold KD outperforms the distillation method in [16] by +2.0% top-1 accuracy on Dei T-Tiny. We also conduct transfer learning experiments on CIFAR-10/100 and evaluate our method on downstream tasks such as object detection and semantic segmentation.
Researcher Affiliation Collaboration Zhiwei Hao1,2, Jianyuan Guo2, Ding Jia2,3, Kai Han2, Yehui Tang2,3, Chao Zhang3, Han Hu1 , Yunhe Wang2 1School of information and Electronics, Beijing Institute of Technology. 2Huawei Noah s Ark Lab. 3Key Laboratory of Machine Perception (MOE), School of Intelligence Science and Technology, Peking University.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Pytorch code: https://github.com/Hao840/manifolddistillation and https://github.com/huawei-noah/Efficient-Computing.
Open Datasets Yes We evaluate our fine-grained manifold distillation method on Image Net-1k [39] classification task, CIFAR-10/100 [40] transfer learning task, COCO [41] object detection task, and ADE20K [42] semantic segmentation task.
Dataset Splits Yes Image Net-1k... consists of more than 1.2M training images and 50K validation images from 1000 classes.
Hardware Specification Yes Each student is trained for 300 epochs with 8 Tesla-V100 GPUs.
Software Dependencies No Our implementation is based on Pytorch framework [43] and the Mind Spore 2 Lite tool [44]. (No specific version for Pytorch is mentioned, and 'Mind Spore 2 Lite' lacks a precise version format like 2.x.x).
Experiment Setup Yes The hyper-parameter λ in the KD loss is set to 1, i.e., the real label is not used to train the student. When the teacher is smaller than the student, to prevent the performance degradation caused by the weak teacher, we set λ to 0.5. In the fine-grained manifold distillation loss, hyper-parameters α, β, and γ are set to 4, 0.1, and 0.2, respectively. The sampling number K in loss term Lrandom is set to 192. ... Each student is trained for 300 epochs with 8 Tesla-V100 GPUs.