SLaM: Student-Label Mixing for Distillation with Unlabeled Examples
Authors: Vasilis Kontonis, Fotis Iliopoulos, Khoa Trinh, Cenk Baykal, Gaurav Menghani, Erik Vee
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLa M) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. |
| Researcher Affiliation | Collaboration | Vasilis Kontonis UT Austin vasilis@cs.utexas.edu Fotis Iliopoulos Google Research fotisi@google.com Khoa Trinh Google Research khoatrinh@google.com Cenk Baykal Google Research baykalc@google.com Gaurav Menghani Google Research gmenghani@google.com Erik Vee Google Research erikvee@google.com |
| Pseudocode | Yes | In this section we present pseudo-code describing the distillation with unlabeled examples setting and the SLa M method, Algorithm 1. |
| Open Source Code | Yes | Remark B.1. We remark that in our experiments, we observed that not normalizing the mixing operation with k(x) 1 resulted in better results overall. Therefore, the mixing operation used in our experimental evaluation of SLa M is mix(f(x; w); α(x), k(x)) = α(x)f(x; w) + (1 α(x))(1 f(x; w)) top(ys(x); k(x)). For more details we refer the reader to the code provided in the supplementary material. |
| Open Datasets | Yes | CIFAR-{10,100} and Celeb A Here we present our results on CIFAR-{10, 100} [30] and Celeb A [22]. Image Net Here we present the results on Image Net [49]. Large Movies Reviews Dataset Here we present results on the Large Movies Reviews Dataset [39]. |
| Dataset Splits | Yes | For each trial we randomly split dataset C into a small (e.g., 500 examples validation dataset V and an unlabeled training dataset U. |
| Hardware Specification | Yes | We ran our experiments on 64 Cloud TPU v4s each with two cores. |
| Software Dependencies | No | We implemented all algorithms in Python and used the Tensor Flow deep learning library [1]. The paper mentions TensorFlow but does not specify a version number for it or for Python. |
| Experiment Setup | Yes | For the experiments on CIFAR-10/100 and Celeb A we use the Adam optimizer with initial learning rate lr = 0.001. We then proceed according to the following learning rate schedule... For SLa M we always use 0.5 as the lower bound for isotonic regression (i.e., the parameter lb in Algorithm 2). |