Progressive Ensemble Distillation: Building Ensembles for Efficient Inference

Authors: Don Dennis, Abhishek Shetty, Anish Prasad Sevekari, Kazuhito Koishida, Virginia Smith

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We empirically evaluate our algorithm on synthetic and real-world classification tasks from computer vision, speech, and sensor processing with models suitable for the respective domains.
Researcher Affiliation Collaboration Don Kurian Dennis Carnegie Mellon University Abhishek Shetty University of California, Berkeley Anish Sevekari Carnegie Mellon University Kazuhito Koishida Microsoft Virginia Smith Carnegie Mellon University
Pseudocode Yes Algorithm 1 B-DISTIL: Main algorithm; Algorithm 2 FIND-WL
Open Source Code Yes Our code can be found at: github.com/metastable B/bdistil.
Open Datasets Yes Our image classification experiments use the CIFAR-10, CIFAR-100, Tiny Image Net and Image Net datasets. For time-series classification tasks we use the Google-13 speech commands dataset. Finally, we use the daily sports activities (DSA) dataset for experiments with sensor data.
Dataset Splits Yes Except for the pretrained Res Net models, all other teacher models are selected based on performance on validation data. Dataset Train-samples Test/Val-samples Num.-labels Source CIFAR-10 50000 10000 10 [29] CIFAR-100 50000 10000 100 [29] DSA-19 6800 2280 19 [14] Google-13 52886 6835 13 [43] Image Net-1k 1281167 50000 1000 [37] Tiny Image Net-200 100000 10000 200 [30]
Hardware Specification Yes For simplicity of presentation, we convert these to the corresponding inference times (τ) on a reference accelerator (NVIDIA 3090Ti).
Software Dependencies No The paper mentions using 'Py Torch' and 'torch.autograd.profiler module', but does not specify exact version numbers for these or any other key software dependencies required for reproducibility.
Experiment Setup Yes For experiments on CIFAR100 and CIFAR10, we use a learning rate of 0.1, a momentum paramter of 0.9, and weight decay of 5 10 4. We train for 200 epochs and reduce the learning rate by a factor of 0.2 in after 30%, 60% and 90% of the epoch execution. We perform a 4-GPU data-parallel training for Image Net with a per-gpu batch size of 256, learning rate 0.1, momentum 0.9, regularization γ of 1.0, and a weight decaur of 1e 4. We train for 90 epochs with and discount the learning rate by a factor of 0.1 at 30% and 60% epochs. For experiments with time series data, Google-13 and DSA-19, we use a fixed learning rate of 0.05 and a momentum of 0.9. We do not use weight decay or learning rate scheduling for time-series data.