Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?
Authors: Keshigeyan Chandrasegaran, Ngoc-Trung Tran, Yunqing Zhao, Ngai-Man Cheung
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacherstudent architectures. |
| Researcher Affiliation | Academia | 1Singapore University of Technology and Design (SUTD). Correspondence to: Ngai-Man Cheung <ngaiman cheung@sutd.edu.sg>. |
| Pseudocode | Yes | We include the visualization algorithm and Numpy-style code in Supplementary F. |
| Open Source Code | Yes | Code and models are available at https://keshik6.github.io/ revisiting-ls-kd-compatibility/ |
| Open Datasets | Yes | large-scale KD experiments including image classification using Image Net-1K (Deng et al., 2009), fine-grained image classification using CUB200-2011 (Wah et al., 2011), neural machine translation (English German, English Russian translation) using IWSLT |
| Dataset Splits | Yes | For visualization of penultimate layer representations, we use 150 samples for training set and 50 samples for validation set. |
| Hardware Specification | No | The paper does not specify particular hardware components such as specific GPU or CPU models used for running the experiments. |
| Software Dependencies | Yes | To allow for training in containerised environments (HPC, Super-computing clusters), please use nvcr.io/nvidia/pytorch:20.12-py3 container. |
| Experiment Setup | Yes | For training LS networks, we train for 90 epochs with initial learning rate 0.1 decayed by a factor of 10 every 30 epochs. For KD experiments, we train for 200 epochs with initial learning rate 0.1 decayed by a factor of 10 every 80 epochs. We conducted a grid search for hyper-parameters as well. For all experiments, we use a batch size of 256 and SGD with momentum 0.9 . |