Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation
Authors: Zhiwei Hao, Jianyuan Guo, Ding Jia, Kai Han, Yehui Tang, Chao Zhang, Han Hu, Yunhe Wang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on the Image Net-1k classification task. The proposed manifold KD outperforms the distillation method in [16] by +2.0% top-1 accuracy on Dei T-Tiny. We also conduct transfer learning experiments on CIFAR-10/100 and evaluate our method on downstream tasks such as object detection and semantic segmentation. |
| Researcher Affiliation | Collaboration | Zhiwei Hao1,2, Jianyuan Guo2, Ding Jia2,3, Kai Han2, Yehui Tang2,3, Chao Zhang3, Han Hu1 , Yunhe Wang2 1School of information and Electronics, Beijing Institute of Technology. 2Huawei Noah s Ark Lab. 3Key Laboratory of Machine Perception (MOE), School of Intelligence Science and Technology, Peking University. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Pytorch code: https://github.com/Hao840/manifolddistillation and https://github.com/huawei-noah/Efficient-Computing. |
| Open Datasets | Yes | We evaluate our fine-grained manifold distillation method on Image Net-1k [39] classification task, CIFAR-10/100 [40] transfer learning task, COCO [41] object detection task, and ADE20K [42] semantic segmentation task. |
| Dataset Splits | Yes | Image Net-1k... consists of more than 1.2M training images and 50K validation images from 1000 classes. |
| Hardware Specification | Yes | Each student is trained for 300 epochs with 8 Tesla-V100 GPUs. |
| Software Dependencies | No | Our implementation is based on Pytorch framework [43] and the Mind Spore 2 Lite tool [44]. (No specific version for Pytorch is mentioned, and 'Mind Spore 2 Lite' lacks a precise version format like 2.x.x). |
| Experiment Setup | Yes | The hyper-parameter λ in the KD loss is set to 1, i.e., the real label is not used to train the student. When the teacher is smaller than the student, to prevent the performance degradation caused by the weak teacher, we set λ to 0.5. In the fine-grained manifold distillation loss, hyper-parameters α, β, and γ are set to 4, 0.1, and 0.2, respectively. The sampling number K in loss term Lrandom is set to 192. ... Each student is trained for 300 epochs with 8 Tesla-V100 GPUs. |