One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation
Authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, Chang Xu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the Image Net-1K dataset. |
| Researcher Affiliation | Collaboration | Zhiwei Hao1,2, Jianyuan Guo3, Kai Han2, Yehui Tang2, Han Hu1 , Yunhe Wang2 , Chang Xu3 1School of information and Electronics, Beijing Institute of Technology. 2Huawei Noah s Ark Lab. 3School of Computer Science, Faculty of Engineering, The University of Sydney. |
| Pseudocode | No | The paper provides mathematical formulations and diagrams, but no structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Py Torch code and checkpoints can be found at https://github.com/Hao840/OFAKD. |
| Open Datasets | Yes | We adopt the CIFAR-100 dataset [52] and the Image Net-1K dataset [53] for evaluation. |
| Dataset Splits | Yes | the Image Net-1K dataset is more extensive, containing 1.2 million training samples and 50,000 validation samples, all with a resolution of 224 224. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch code' but does not specify its version or any other software dependencies with version numbers. |
| Experiment Setup | Yes | Specifically, all CNN students are trained using the SGD optimizer, while those with a Vi T or MLP architecture are trained using the Adam W optimizer. For the CIFAR-100 dataset, all models are trained for 300 epochs. When working with the Image Net-1K dataset, CNNs are trained for 100 epochs, whereas Vi Ts and MLPs are trained for 300 epochs. |