Knowledge Distillation with Auxiliary Variable

Authors: Bo Peng, Zhen Fang, Guangquan Zhang, Jie Lu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments Baselines. We compare our method with mainstream knowledge distillers, including, KD (Hinton et al., 2015), DKD (Yeh et al., 2022), IPWD (Niu et al., 2022), WSLD (Zhou et al., 2021), CS-KD (Yun et al., 2020), TF-KD (Yuan et al., 2020), PS-KD (Kim et al., 2021), NKD (Yang et al., 2023), MLD (Jin et al., 2023), DIST (Huang et al., 2022a), Fit Nets (Romero et al., 2014), CRD (Tian et al., 2019), WCo RD (Chen et al., 2021a), Review KD (Chen et al., 2021b), NORM (Liu et al., 2023), Co Co RD (Fu et al., 2023), Diff KD (Huang et al., 2023), SRRL (Yang et al., 2021) and SSKD (Xu et al., 2020). Settings. We conduct experiments on multiple benchmarks for knowledge transfer: CIFAR-100 (Krizhevsky et al., 2009), Image Net-1K (Russakovsky et al., 2015), STL-10 (Coates et al., 2011), Tiny-Image Net (Chrabaszcz et al., 2017), PASCAL-VOC (Everingham et al., 2009) and MSCOCO (Lin et al., 2014).
Researcher Affiliation Academia 1Faculty of Engineering & Information Technology, University of Technology Sydney, Sydney, Australia.
Pseudocode Yes Algorithm 1 knowledge distillation with auxiliary variable
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets Yes Settings. We conduct experiments on multiple benchmarks for knowledge transfer: CIFAR-100 (Krizhevsky et al., 2009), Image Net-1K (Russakovsky et al., 2015), STL-10 (Coates et al., 2011), Tiny-Image Net (Chrabaszcz et al., 2017), PASCAL-VOC (Everingham et al., 2009) and MSCOCO (Lin et al., 2014).
Dataset Splits No The paper mentions using a 'validation set' for Image Net-1K, as seen in 'Table 4: Top-1 and Top-5 accuracy (%) on Image Net-1K validation set.', but it does not specify the explicit train/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, memory) used to run its experiments. The 'Implementation Details' section focuses on software parameters and training settings.
Software Dependencies No The paper mentions 'Py Torch Image Net practice' but does not specify exact version numbers for PyTorch or any other software dependencies needed for reproduction.
Experiment Setup Yes We set the batch size as 64 and the initial learning rate as 0.01 (for Shuffle Net and Mobile Net-V2) or 0.05 (for the other series). We train the model for 240 epochs, in which the learning rate is decayed by 10 every 30 epochs after 150 epochs. We use SGD as the optimizer with weight decay 5e 4 and momentum 0.9.