One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

Authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, Chang Xu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Specifically, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the Image Net-1K dataset.
Researcher Affiliation Collaboration Zhiwei Hao1,2, Jianyuan Guo3, Kai Han2, Yehui Tang2, Han Hu1 , Yunhe Wang2 , Chang Xu3 1School of information and Electronics, Beijing Institute of Technology. 2Huawei Noah s Ark Lab. 3School of Computer Science, Faculty of Engineering, The University of Sydney.
Pseudocode No The paper provides mathematical formulations and diagrams, but no structured pseudocode or algorithm blocks.
Open Source Code Yes Py Torch code and checkpoints can be found at https://github.com/Hao840/OFAKD.
Open Datasets Yes We adopt the CIFAR-100 dataset [52] and the Image Net-1K dataset [53] for evaluation.
Dataset Splits Yes the Image Net-1K dataset is more extensive, containing 1.2 million training samples and 50,000 validation samples, all with a resolution of 224 224.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies No The paper mentions 'Py Torch code' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes Specifically, all CNN students are trained using the SGD optimizer, while those with a Vi T or MLP architecture are trained using the Adam W optimizer. For the CIFAR-100 dataset, all models are trained for 300 epochs. When working with the Image Net-1K dataset, CNNs are trained for 100 epochs, whereas Vi Ts and MLPs are trained for 300 epochs.