Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

Authors: Kenneth Borup, Lars N Andersen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, it has been shown that this procedure often generalizes better than the model trained merely on the original targets, and achieves higher predictive performance on validation data, despite no additional information being provided during training (Furlanello et al., 2018; Ahn et al., 2019; Yang et al., 2018). Experimental results in Section B can be found at github.com/Kennethborup/self_distillation.
Researcher Affiliation Academia Kenneth Borup Department of Mathematics Aarhus University kennethborup@math.au.dk Lars N. Andersen Department of Mathematics Aarhus University larsa@math.au.dk
Pseudocode Yes Algorithm 1: Calculate ˆβ(τ) and α (τ) for τ 2. Calculate ˆβ(1) from (3) (with any α(1)); Calculate y(1) = f( X, ˆβ(1)); for t = 2 to τ do Calculate ˆβ(t) α=0 from (3) and y(t) α=0 = f( X, ˆβ(t) α=0); Solve: α (t) = argmin α R y α y(1) + (1 α) y(t) α=0 2 Calculate ˆβ(t) from (3) with α (t); end
Open Source Code Yes Experimental results in Section B can be found at github.com/Kennethborup/self_distillation.
Open Datasets Yes We perform self-distillation with Res Net-50 (He et al., 2016) networks on CIFAR-10 (Krizhevsky and Hinton, 2009)
Dataset Splits No The paper mentions using validation data for comparison, but it does not provide specific details on the dataset split percentages, sample counts, or the methodology used to create these splits.
Hardware Specification No The paper states: "We would like to thank Genome DK and Aarhus University for providing computational resources that contributed to these research results." This statement is too general and does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts).
Software Dependencies No The paper mentions "Py Torch Lightning" in a citation but does not specify a version number. No other specific software with version numbers is provided.
Experiment Setup Yes The model is initialized randomly at each step and trained according to the above with either estimated optimal parameters, ˆα(τ), or fixed α for all steps. We use the network weights from the last iteration of training at each distillation step for the next step, irrespective of whether a better model occurred earlier in the training. Our models are trained for a fixed 75 epochs and each experiment is repeated with 4 different random seeds over 11 chains of distillation steps.