Bayesian Knowledge Distillation: A Bayesian Perspective of Distillation with Uncertainty Quantification
Authors: Luyang Fang, Yongkai Chen, Wenxuan Zhong, Ping Ma
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed BKD on both synthetic and real benchmark datasets. We also evaluate BKD on some synthetic datasets, presented in Appendix C. The empirical performance of BKD is demonstrated on both synthetic and real datasets. |
| Researcher Affiliation | Academia | 1Department of Statistics, University of Georgia, Athens, USA. Correspondence to: Wenxuan Zhong <wenxuan@uga.edu>, Ping Ma <pingma@uga.edu>. |
| Pseudocode | Yes | Algorithm 1 Bayesian Knowledge Distillation (BKD). Input: D = {(xi, yi)}N i=1, h( , ), τ, λ, r. 1: Get the output p of the teacher model for each data point in D. 2: Calculate the posterior distribution of q = h(x, θ). 3: Generate Monte Carlo sample of θ: At iteration jth with a subset of m data points D(j) = {(x(j) i , y(j) i )}m i=1, Generate ξ(j) N(0, I), Generate θ(j) using SGLD as in Equation (9). Output: Monte Carlo sample {θ(j)}r j=1 of θ. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of the described methodology. |
| Open Datasets | Yes | We test the proposed BKD method on four benchmark datasets, (1) MNIST, (2) Fashion MNIST, (3) CIFAR-10, and (4) CIFAR-100. Detailed information about the datasets can be found in Appendix D.1. (MNIST (Le Cun, 1998) is a dataset of handwritten digit images with a training set of 60, 000 examples and a test set of 10, 000 examples.) |
| Dataset Splits | Yes | We consider four different scenarios for generating synthetic data, dividing the data into training, validation, and testing sets in a 7:3:1 ratio for all scenarios. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run its experiments. |
| Software Dependencies | No | The paper mentions the use of 'PyTorch torchvision library' but does not provide specific version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | Algorithm 1 Bayesian Knowledge Distillation (BKD). Input: D = {(xi, yi)}N i=1, h( , ), τ, λ, r. (Appendix D.2, MNIST dataset): Specifically, the teacher model employs an MLP with two hidden layers of 1200 hidden nodes. The model uses the ReLU activation function and incorporates a dropout rate of 0.5. The model also incorporates a dropout layer with rate 0.2 for the input. The student model employs an MLP architecture consisting of two hidden layers. These layers have 200 and 100 nodes, respectively. The model uses the ReLU activation function. |