reproducibilityindex.ai

One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation

Authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Yehui Tang, Han Hu, Yunhe Wang, Chang Xu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with various architectures, including CNN, Transformer, and MLP, demonstrate the superiority of our OFA-KD framework in enabling distillation between heterogeneous architectures. Speciﬁcally, when equipped with our OFA-KD, the student models achieve notable performance improvements, with a maximum gain of 8.0% on the CIFAR-100 dataset and 0.7% on the Image Net-1K dataset.
Researcher Affiliation	Collaboration	Zhiwei Hao1,2, Jianyuan Guo3, Kai Han2, Yehui Tang2, Han Hu1 , Yunhe Wang2 , Chang Xu3 1School of information and Electronics, Beijing Institute of Technology. 2Huawei Noah s Ark Lab. 3School of Computer Science, Faculty of Engineering, The University of Sydney.
Pseudocode	No	The paper provides mathematical formulations and diagrams, but no structured pseudocode or algorithm blocks.
Open Source Code	Yes	Py Torch code and checkpoints can be found at https://github.com/Hao840/OFAKD.
Open Datasets	Yes	We adopt the CIFAR-100 dataset [52] and the Image Net-1K dataset [53] for evaluation.
Dataset Splits	Yes	the Image Net-1K dataset is more extensive, containing 1.2 million training samples and 50,000 validation samples, all with a resolution of 224 224.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU models, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions 'Py Torch code' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup	Yes	Speciﬁcally, all CNN students are trained using the SGD optimizer, while those with a Vi T or MLP architecture are trained using the Adam W optimizer. For the CIFAR-100 dataset, all models are trained for 300 epochs. When working with the Image Net-1K dataset, CNNs are trained for 100 epochs, whereas Vi Ts and MLPs are trained for 300 epochs.