Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation
Authors: Kenneth Borup, Lars N Andersen
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, it has been shown that this procedure often generalizes better than the model trained merely on the original targets, and achieves higher predictive performance on validation data, despite no additional information being provided during training (Furlanello et al., 2018; Ahn et al., 2019; Yang et al., 2018). Experimental results in Section B can be found at github.com/Kennethborup/self_distillation. |
| Researcher Affiliation | Academia | Kenneth Borup Department of Mathematics Aarhus University kennethborup@math.au.dk Lars N. Andersen Department of Mathematics Aarhus University larsa@math.au.dk |
| Pseudocode | Yes | Algorithm 1: Calculate ˆβ(τ) and α (τ) for τ 2. Calculate ˆβ(1) from (3) (with any α(1)); Calculate y(1) = f( X, ˆβ(1)); for t = 2 to τ do Calculate ˆβ(t) α=0 from (3) and y(t) α=0 = f( X, ˆβ(t) α=0); Solve: α (t) = argmin α R y α y(1) + (1 α) y(t) α=0 2 Calculate ˆβ(t) from (3) with α (t); end |
| Open Source Code | Yes | Experimental results in Section B can be found at github.com/Kennethborup/self_distillation. |
| Open Datasets | Yes | We perform self-distillation with Res Net-50 (He et al., 2016) networks on CIFAR-10 (Krizhevsky and Hinton, 2009) |
| Dataset Splits | No | The paper mentions using validation data for comparison, but it does not provide specific details on the dataset split percentages, sample counts, or the methodology used to create these splits. |
| Hardware Specification | No | The paper states: "We would like to thank Genome DK and Aarhus University for providing computational resources that contributed to these research results." This statement is too general and does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts). |
| Software Dependencies | No | The paper mentions "Py Torch Lightning" in a citation but does not specify a version number. No other specific software with version numbers is provided. |
| Experiment Setup | Yes | The model is initialized randomly at each step and trained according to the above with either estimated optimal parameters, ˆα(τ), or fixed α for all steps. We use the network weights from the last iteration of training at each distillation step for the next step, irrespective of whether a better model occurred earlier in the training. Our models are trained for a fixed 75 epochs and each experiment is repeated with 4 different random seeds over 11 chains of distillation steps. |