Random Teachers are Good Teachers
Authors: Felix Sarnthein, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already possess very interesting characteristics; (1) we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy. (2) The learned representations are data-dependent and transferable between different tasks but deteriorate strongly if trained on random inputs. |
| Researcher Affiliation | Academia | 1Department of Computer Science, ETH Z urich, Switzerland. |
| Pseudocode | Yes | Let us summarize the method in pseudo-code: |
| Open Source Code | Yes | Code is available at www.github.com/safelix/dinopl. |
| Open Datasets | Yes | Unless otherwise stated, we perform probing on the CIFAR10 dataset (Krizhevsky & Hinton, 2009) and aggregate mean and standard deviation over three runs. We expand our experimental setup to more datasets, including CIFAR100 (Krizhevsky & Hinton, 2009), STL10 (Coates et al., 2011) and Tiny Image Net (Le & Yang, 2015). |
| Dataset Splits | No | The paper mentions training and test sets but does not explicitly detail a separate validation set for model training or hyperparameter tuning. |
| Hardware Specification | No | The paper does not specify the hardware used for running the experiments. |
| Software Dependencies | No | The paper mentions various software components like PyTorch, Adam, SGD, and torchvision, but does not specify their version numbers. |
| Experiment Setup | Yes | We minimize the objective (1) with the ADAM optimizer (Kingma & Ba, 2014) using a learning rate η = 0.001. ... We follow the protocol of non-contrastive learning and initialize the student closely to the teacher. ... To that end, we consider initializations of the form (1 α)θT + α θ , where θ INIT is a fresh initialization, α [0, 1] and δ = p α2 + (1 α)2 ensures that the variance remains constant α [0, 1]. ... Appendix E. Experimental Details: Configuration Encoder Res Net18&VGG1 from torchvision, without fc or classification layers (embedding R512) (Res Net18 adjusted stem for CIFAR: conv from 7x7 to 3x3, remove maxpool) Projection Head 3-Layer MLP: 512 2048 2048 l2-bottleneck(256) 216 ... Training batchsize 64 per GPU 256 ... Optimizer Adam W ... Learning rate 0.001 (torch default) Weight decay 0.04 lin 0.4 schedule not applied Gradient Clipping to norm 3 not applied Freezing of last layer during first epoch not applied. |