Improving Ensemble Distillation With Weight Averaging and Diversifying Perturbation

Authors: Giung Nam, Hyungi Lee, Byeongho Heo, Juho Lee

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present the experimental results on image classification benchmarks including CIFAR-10, CIFAR100 (Krizhevsky, 2009), Tiny Image Net, and Image Net-1k (Russakovsky et al., 2015). Through the experiments, we empirically validate the following questions: How does the subspace discovered by Latent BE look like? Section 5.1. How does the proposed perturbation strategy affect the training of Latent BE? Section 5.2. Does our ensemble distillation algorithm improves performance both in terms of predictive accuracy and uncertainty calibration? Section 5.3. Please refer to Appendix B for the training details including data augmentation, learning rate schedules, and other hyperparameter settings.
Researcher Affiliation Collaboration 1Korea Advanced Institute of Science and Technology, Daejeon, Korea 2Naver, Korea 3AITRICS, Seoul, South Korea.
Pseudocode Yes Algorithm 1 Ensemble distillation with BE ... Algorithm 2 Ensemble distillation with Latent BE + diversifying perturbation
Open Source Code Yes Code is available at https://github.com/cs-giung/distill-latentbe.
Open Datasets Yes CIFAR-10/100 The dataset is available at https://www.cs.toronto.edu/ kriz/cifar.html. ... Tiny Image Net The dataset is available at http://cs231n.stanford.edu/tiny-imagenet-200.zip. ... Image Net-1k It consists of 1,281,167 train examples, 50,000 validation examples and 100,000 test images from 1,000 classes.
Dataset Splits Yes It consists of 50,000 train examples and 10,000 test examples from 10/100 classes, with images size of 32 32 3. In this paper, the last 5,000 examples of the train split are used as the validation split for computing calibrated metrics. ... Consequently, the last 500 examples for each class of the train split are used as the validation split for computing calibrated metrics, i.e., train and validation split consists of 90,000 and 10,000 examples, respectively.
Hardware Specification Yes More specifically, Table 5 reports the runtimes of BE and Latent BE on the same single Ge Force RTX 3090 setting. ... Besides, the experiments on Image Net-1k are conducted with 8 TPUv3 cores, supported by the TPU Research Cloud2.
Software Dependencies Yes Our implementation for the experiments on CIFAR-10/100 and Tiny Image Net are built on Py Torch (Paszke et al., 2019).
Experiment Setup Yes All images are standardized by subtracting the per-channel mean and dividing the result by the per-channel standard deviation. We use SGD optimizer with Nesterov momentum 0.9, and a single-cycle cosine annealing learning rate schedule with a linear warm-up, i.e., the learning rate starts from 0.01 base lr and reaches base lr after the first 5 epochs, and is decayed by the single-cycle cosine annealing learning rate schedule. More precisely, (1) for CIFAR-10/100, we run 200 epochs on a single machine with batch size 128 and base lr = 0.1, (2) for Tiny Image Net, we run 80 epochs on four machines with the total batch size 128 and base lr = 0.1, and (3) for Image Net-1k, we run 100 epochs on eight machines with the total batch size 256 and base lr = 0.1. We also apply the weight decay (Krogh & Hertz, 1991) to regularize training; the weight decay coefficient is set to be 0.0005 for CIFAR-10/100 and Tiny Image Net, and 0.0001 for Image Net-1k.