What Knowledge Gets Distilled in Knowledge Distillation?
Authors: Utkarsh Ojha, Yuheng Li, Anirudh Sundara Rajan, Yingyu Liang, Yong Jae Lee
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work presents a comprehensive study to try to answer these questions. We show that existing methods can indeed indirectly distill these properties beyond improving task performance. We further study why knowledge distillation might work this way, and show that our findings have practical implications as well.We now discuss our experimental design. To reach conclusions that are generalizable across different architectures and datasets, and robust across independent runs, we experiment with a variety of teacher-student architectures, and tune the hyperparameters so that the distillation objective improves the test performance of the student compared to independent training. |
| Researcher Affiliation | Academia | Utkarsh Ojha Yuheng Li Anirudh Sundara Rajan Yingyu Liang Yong Jae Lee University of Wisconsin-Madison |
| Pseudocode | No | The paper describes algorithms using mathematical equations, but there are no pseudocode blocks or clearly labeled algorithm sections. |
| Open Source Code | No | The paper does not include a statement about releasing code for its methodology or provide a link to a code repository. |
| Open Datasets | Yes | We train three models for Image Net classification (Sec 4.1).We use MNIST digit recognition (Sec 5.1).Fair Face dataset [15] (Sec 5.2).CIFAR100 [16], VLCS and PACS [18] (Sec 4.5).These are well-known public datasets and are cited. |
| Dataset Splits | Yes | We evaluate the models on 50k Image Net validation images (Sec 4.3).From its training split, we create two different subsets (Ds and Dt) with the following objectives (i) Ds has a particular racial composition so that a model trained on it will perform fairly across all races during test time; (ii) Dt s composition is intended to make the model perform unfairly for certain races. The exact composition of Ds and Dt is given in the appendix. (Sec 5.2). This indicates specific data splitting for the Fair Face dataset. |
| Hardware Specification | No | The paper discusses model architectures and datasets but does not specify any particular hardware components (e.g., specific GPU or CPU models, memory details) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Grad-CAM [27]' but does not provide specific version numbers for any software, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | The overall loss function is γLCLS + αLKL, where γ and α are balancing parameters. (Sec 3.1).The overall loss is γLCLS + βLHint, where γ and β are balancing parameters. (Sec 3.2).All other implementation details (e.g., temperature for KL, layer index for Hint) can be found in appendix. (Sec 3.3).Specifically, we alter an image s brightness, contrast, saturation, and hue, with magnitudes sampled uniformly in [0, 0.4] for the first three, and in [0, 0.2] for hue. (Sec 4.3).These details provide specific hyperparameter values and training configurations. |