Unifying distillation and privileged information

Authors: David Lopez-Paz, Leon Bottou, Bernhard Schölkopf, Vladimir Vapnik

ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical and causal insight about the inner workings of generalized distillation, extend it to unsupervised, semisupervised and multitask learning scenarios, and illustrate its efficacy on a variety of numerical simulations on both synthetic and real-world data.5 NUMERICAL SIMULATIONS We now present some experiments to illustrate when the distillation of privileged information is effective, and when it is not. The necessary Python code to replicate all the following experiments is available at http://github.com/lopezpaz.
Researcher Affiliation Collaboration David Lopez-Paz Facebook AI Research, Paris, France dlp@fb.com L eon Bottou Facebook AI Research, New York, USA leon@bottou.org Bernhard Sch olkopf Max Planck Insitute for Intelligent Systems, T ubingen, Germany bs@tuebingen.mpg.de Vladimir Vapnik Facebook AI Research and Columbia University, New York, USA vladimir.vapnik@gmail.com
Pseudocode No Then, the process of generalized distillation is as follows: 1. Learn teacher ft Ft using the input-output pairs {(x i , yi)}n i=1 and Eq. 3. 2. Compute teacher soft labels {σ(ft(x i )/T)}n i=1, using temperature parameter T > 0. 3. Learn student fs Fs using the input-output pairs {(xi, yi)}n i=1, {(xi, si)}n i=1, Eq. 4, and imitation parameter λ [0, 1].2
Open Source Code Yes The necessary Python code to replicate all the following experiments is available at http://github.com/lopezpaz.
Open Datasets Yes 5. MNIST handwritten digit image classification The privileged features are the original 28x28 pixels MNIST handwritten digit images (Le Cun et al., 1998b) 6. Semisupervised learning We explore the semisupervised capabilities of generalized distillation on the CIFAR10 dataset (Krizhevsky, 2009). 7. Multitask learning The SARCOS dataset (Vijayakumar, 2000)
Dataset Splits No We use 300 or 500 samples to train both the teacher and the student, and test their accuracies at multiple levels of temperature and imitation on the full test set.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or other computing specifications used for running the experiments.
Software Dependencies No The necessary Python code to replicate all the following experiments is available at http://github.com/lopezpaz.
Experiment Setup Yes Both student and teacher are neural networks of composed by two hidden layers of 20 rectifier linear units and a softmax output layer (the same networks are used in the remaining experiments). The temperature parameter T > 0 controls how much do we want to soften or smooth the class-probability predictions from ft, and the imitation parameter λ [0, 1] balances the importance between imitating the soft predictions si and predicting the true hard labels yi. distilling the teacher explanations into the student classifier with λ = T = 1.