UniADS: Universal Architecture-Distiller Search for Distillation Gap

Authors: Liming Lu, Zhenghan Chen, Xiaoyu Lu, Yihang Rao, Lujun Li, Shuchao Pang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are performed on different teacher-student pairs using CIFAR-100 and Image Net datasets. The experimental results consistently demonstrate the superiority of our method over existing approaches.
Researcher Affiliation Academia Liming Lu1*, Zhenghan Chen2*, Xiaoyu Lu1 , Yihang Rao1, Lujun Li3, Shuchao Pang1,4 1School of Cyber Science and Engineering, Nanjing University of Science and Technology 2Peking University 3HKUST 4School of Computing, Macquarie University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states 'Full implementation details are available in the supplementary materials,' but it does not explicitly state that source code for their methodology is released or provide a link to a code repository.
Open Datasets Yes Extensive experiments are performed on different teacher-student pairs using CIFAR-100 and Image Net datasets. CIFAR-100 (Krizhevsky and Hinton 2009), containing 50,000 training images and 10,000 test images with 100 classes, is the most popular classification dataset for evaluating the performance of knowledge distillation methods. We also conduct experiments on the Image Net dataset (ILSVRC12) (Deng et al. 2009), which is known as one of the most challenging image classification datasets.
Dataset Splits Yes CIFAR-100 (Krizhevsky and Hinton 2009), containing 50,000 training images and 10,000 test images with 100 classes... We also conduct experiments on the Image Net dataset (ILSVRC12) (Deng et al. 2009), which is known as one of the most challenging image classification datasets. It contains about 1.2 million training images and 50 thousand validation images, each belonging to one of the 1,000 categories.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU/CPU models or memory specifications.
Software Dependencies No The paper mentions 'All experiments are conducted with Py Torch (Paszke et al. 2019)' but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes All teacher-student networks are trained with the typical training setting of 200 epochs, following the original papers. During the distiller search phase... including 24 early-stop training epochs. Our Uni ADS search performs 200 iterations for each teacher-student pair. During the distillation phase, all teacher-student networks are trained using typical training settings, with a training epoch of 240. We set the batch size to 128 and use a standard SGD optimizer. The learning rate is initialized to 0.1 and decays by 0.1 at 100 and 150 epochs. All teacher-student networks are trained with an SGD optimizer for 100 training epochs. The batch size is set to 256, and the learning rate is initialized to 0.1 and decays by 0.1 at 30, 60, and 90 epochs.