Self-Distillation Amplifies Regularization in Hilbert Space
Authors: Hossein Mobahi, Mehrdad Farajtabar, Peter Bartlett
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to 2 regularization in this function space. We show that selfdistillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance. ... In our experiments, we aim to empirically evaluate our theoretical analysis in the setting of deep networks. ... Both of these phenomena are observed in four left plots Figure 3. |
| Researcher Affiliation | Collaboration | Hossein Mobahi hmobahi@google.com Google Research, Mountain View, CA, USA Mehrdad Farajtabar farajtabar@google.com Deep Mind, Mountain View, CA, USA Peter L. Bartlett bartlett@eecs.berkeley.edu EECS Dept., University of California at Berkeley, Berkeley, CA, USA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Full proofs for these as well as the code for reproducing examples in Sections 4 and results in Section 5 are available in the supplementary appendix. |
| Open Datasets | Yes | We use Resnet [12] and VGG [30] neural architectures and train them on CIFAR-10 and CIFAR-100 datasets [18]. |
| Dataset Splits | No | The paper mentions CIFAR-10 and CIFAR-100 but does not specify explicit train/validation/test splits within the main text. It states 'Training details and additional results are left to the appendix.' |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models used for experiments. |
| Software Dependencies | No | The paper does not specify version numbers for any software dependencies, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | No | The paper generally mentions training with 'Resnet' and 'VGG' architectures using '2 loss' and 'cross-entropy loss' and 'randomly initialized weights'. However, specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training configurations are not provided in the main text, stating 'Training details and additional results are left to the appendix.' |