SLaM: Student-Label Mixing for Distillation with Unlabeled Examples

Authors: Vasilis Kontonis, Fotis Iliopoulos, Khoa Trinh, Cenk Baykal, Gaurav Menghani, Erik Vee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLa M) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks.
Researcher Affiliation Collaboration Vasilis Kontonis UT Austin vasilis@cs.utexas.edu Fotis Iliopoulos Google Research fotisi@google.com Khoa Trinh Google Research khoatrinh@google.com Cenk Baykal Google Research baykalc@google.com Gaurav Menghani Google Research gmenghani@google.com Erik Vee Google Research erikvee@google.com
Pseudocode Yes In this section we present pseudo-code describing the distillation with unlabeled examples setting and the SLa M method, Algorithm 1.
Open Source Code Yes Remark B.1. We remark that in our experiments, we observed that not normalizing the mixing operation with k(x) 1 resulted in better results overall. Therefore, the mixing operation used in our experimental evaluation of SLa M is mix(f(x; w); α(x), k(x)) = α(x)f(x; w) + (1 α(x))(1 f(x; w)) top(ys(x); k(x)). For more details we refer the reader to the code provided in the supplementary material.
Open Datasets Yes CIFAR-{10,100} and Celeb A Here we present our results on CIFAR-{10, 100} [30] and Celeb A [22]. Image Net Here we present the results on Image Net [49]. Large Movies Reviews Dataset Here we present results on the Large Movies Reviews Dataset [39].
Dataset Splits Yes For each trial we randomly split dataset C into a small (e.g., 500 examples validation dataset V and an unlabeled training dataset U.
Hardware Specification Yes We ran our experiments on 64 Cloud TPU v4s each with two cores.
Software Dependencies No We implemented all algorithms in Python and used the Tensor Flow deep learning library [1]. The paper mentions TensorFlow but does not specify a version number for it or for Python.
Experiment Setup Yes For the experiments on CIFAR-10/100 and Celeb A we use the Adam optimizer with initial learning rate lr = 0.001. We then proceed according to the following learning rate schedule... For SLa M we always use 0.5 as the lower bound for isotonic regression (i.e., the parameter lb in Algorithm 2).