Bayesian Knowledge Distillation: A Bayesian Perspective of Distillation with Uncertainty Quantification

Authors: Luyang Fang, Yongkai Chen, Wenxuan Zhong, Ping Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed BKD on both synthetic and real benchmark datasets. We also evaluate BKD on some synthetic datasets, presented in Appendix C. The empirical performance of BKD is demonstrated on both synthetic and real datasets.
Researcher Affiliation Academia 1Department of Statistics, University of Georgia, Athens, USA. Correspondence to: Wenxuan Zhong <wenxuan@uga.edu>, Ping Ma <pingma@uga.edu>.
Pseudocode Yes Algorithm 1 Bayesian Knowledge Distillation (BKD). Input: D = {(xi, yi)}N i=1, h( , ), τ, λ, r. 1: Get the output p of the teacher model for each data point in D. 2: Calculate the posterior distribution of q = h(x, θ). 3: Generate Monte Carlo sample of θ: At iteration jth with a subset of m data points D(j) = {(x(j) i , y(j) i )}m i=1, Generate ξ(j) N(0, I), Generate θ(j) using SGLD as in Equation (9). Output: Monte Carlo sample {θ(j)}r j=1 of θ.
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of the described methodology.
Open Datasets Yes We test the proposed BKD method on four benchmark datasets, (1) MNIST, (2) Fashion MNIST, (3) CIFAR-10, and (4) CIFAR-100. Detailed information about the datasets can be found in Appendix D.1. (MNIST (Le Cun, 1998) is a dataset of handwritten digit images with a training set of 60, 000 examples and a test set of 10, 000 examples.)
Dataset Splits Yes We consider four different scenarios for generating synthetic data, dividing the data into training, validation, and testing sets in a 7:3:1 ratio for all scenarios.
Hardware Specification No The paper does not provide specific details about the hardware used to run its experiments.
Software Dependencies No The paper mentions the use of 'PyTorch torchvision library' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes Algorithm 1 Bayesian Knowledge Distillation (BKD). Input: D = {(xi, yi)}N i=1, h( , ), τ, λ, r. (Appendix D.2, MNIST dataset): Specifically, the teacher model employs an MLP with two hidden layers of 1200 hidden nodes. The model uses the ReLU activation function and incorporates a dropout rate of 0.5. The model also incorporates a dropout layer with rate 0.2 for the input. The student model employs an MLP architecture consisting of two hidden layers. These layers have 200 and 100 nodes, respectively. The model uses the ReLU activation function.