Random Teachers are Good Teachers

Authors: Felix Sarnthein, Gregor Bachmann, Sotiris Anagnostidis, Thomas Hofmann

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To isolate its effect, we describe a simple experiment where we consider teachers at random initialization instead of trained teachers. Surprisingly, when distilling a student into such a random teacher, we observe that the resulting model and its representations already possess very interesting characteristics; (1) we observe a strong improvement of the distilled student over its teacher in terms of probing accuracy. (2) The learned representations are data-dependent and transferable between different tasks but deteriorate strongly if trained on random inputs.
Researcher Affiliation Academia 1Department of Computer Science, ETH Z urich, Switzerland.
Pseudocode Yes Let us summarize the method in pseudo-code:
Open Source Code Yes Code is available at www.github.com/safelix/dinopl.
Open Datasets Yes Unless otherwise stated, we perform probing on the CIFAR10 dataset (Krizhevsky & Hinton, 2009) and aggregate mean and standard deviation over three runs. We expand our experimental setup to more datasets, including CIFAR100 (Krizhevsky & Hinton, 2009), STL10 (Coates et al., 2011) and Tiny Image Net (Le & Yang, 2015).
Dataset Splits No The paper mentions training and test sets but does not explicitly detail a separate validation set for model training or hyperparameter tuning.
Hardware Specification No The paper does not specify the hardware used for running the experiments.
Software Dependencies No The paper mentions various software components like PyTorch, Adam, SGD, and torchvision, but does not specify their version numbers.
Experiment Setup Yes We minimize the objective (1) with the ADAM optimizer (Kingma & Ba, 2014) using a learning rate η = 0.001. ... We follow the protocol of non-contrastive learning and initialize the student closely to the teacher. ... To that end, we consider initializations of the form (1 α)θT + α θ , where θ INIT is a fresh initialization, α [0, 1] and δ = p α2 + (1 α)2 ensures that the variance remains constant α [0, 1]. ... Appendix E. Experimental Details: Configuration Encoder Res Net18&VGG1 from torchvision, without fc or classification layers (embedding R512) (Res Net18 adjusted stem for CIFAR: conv from 7x7 to 3x3, remove maxpool) Projection Head 3-Layer MLP: 512 2048 2048 l2-bottleneck(256) 216 ... Training batchsize 64 per GPU 256 ... Optimizer Adam W ... Learning rate 0.001 (torch default) Weight decay 0.04 lin 0.4 schedule not applied Gradient Clipping to norm 3 not applied Freezing of last layer during first epoch not applied.