What Knowledge Gets Distilled in Knowledge Distillation?

Authors: Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, Yong Jae Lee

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.We now discuss our experimental design. To reach conclusions that are generalizable across different architectures and datasets, and robust across independent runs, we experiment with a variety of teacher-student architectures, and tune the hyperparameters so that the distillation objective improves the test performance of the student compared to independent training.
Researcher Affiliation Academia Utkarsh Ojha Yuheng Li Anirudh Sundara Rajan Yingyu Liang Yong Jae Lee University of Wisconsin-Madison
Pseudocode No The paper describes algorithms using mathematical equations, but there are no pseudocode blocks or clearly labeled algorithm sections.
Open Source Code No The paper does not include a statement about releasing code for its methodology or provide a link to a code repository.
Open Datasets Yes We train three models for Image Net classification (Sec 4.1).We use MNIST digit recognition (Sec 5.1).Fair Face dataset [15] (Sec 5.2).CIFAR100 [16], VLCS and PACS [18] (Sec 4.5).These are well-known public datasets and are cited.
Dataset Splits Yes We evaluate the models on 50k Image Net validation images (Sec 4.3).From its training split, we create two different subsets (Ds and Dt) with the following objectives (i) Ds has a particular racial composition so that a model trained on it will perform fairly across all races during test time; (ii) Dt s composition is intended to make the model perform unfairly for certain races. The exact composition of Ds and Dt is given in the appendix. (Sec 5.2). This indicates specific data splitting for the Fair Face dataset.
Hardware Specification No The paper discusses model architectures and datasets but does not specify any particular hardware components (e.g., specific GPU or CPU models, memory details) used for running the experiments.
Software Dependencies No The paper mentions using 'Grad-CAM [27]' but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments.
Experiment Setup Yes The overall loss function is γLCLS + αLKL, where γ and α are balancing parameters. (Sec 3.1).The overall loss is γLCLS + βLHint, where γ and β are balancing parameters. (Sec 3.2).All other implementation details (e.g., temperature for KL, layer index for Hint) can be found in appendix. (Sec 3.3).Specifically, we alter an image s brightness, contrast, saturation, and hue, with magnitudes sampled uniformly in [0, 0.4] for the first three, and in [0, 0.2] for hue. (Sec 4.3).These details provide specific hyperparameter values and training configurations.