Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

Authors: Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, Ngai-Man Cheung

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacherstudent architectures.
Researcher Affiliation Academia 1Singapore University of Technology and Design (SUTD). Correspondence to: Ngai-Man Cheung <ngaiman cheung@sutd.edu.sg>.
Pseudocode Yes We include the visualization algorithm and Numpy-style code in Supplementary F.
Open Source Code Yes Code and models are available at https://keshik6.github.io/ revisiting-ls-kd-compatibility/
Open Datasets Yes large-scale KD experiments including image classification using Image Net-1K (Deng et al., 2009), fine-grained image classification using CUB200-2011 (Wah et al., 2011), neural machine translation (English German, English Russian translation) using IWSLT
Dataset Splits Yes For visualization of penultimate layer representations, we use 150 samples for training set and 50 samples for validation set.
Hardware Specification No The paper does not specify particular hardware components such as specific GPU or CPU models used for running the experiments.
Software Dependencies Yes To allow for training in containerised environments (HPC, Super-computing clusters), please use nvcr.io/nvidia/pytorch:20.12-py3 container.
Experiment Setup Yes For training LS networks, we train for 90 epochs with initial learning rate 0.1 decayed by a factor of 10 every 30 epochs. For KD experiments, we train for 200 epochs with initial learning rate 0.1 decayed by a factor of 10 every 80 epochs. We conducted a grid search for hyper-parameters as well. For all experiments, we use a batch size of 256 and SGD with momentum 0.9 .