reproducibilityindex.ai

Progressive Ensemble Distillation: Building Ensembles for Efficient Inference

Authors: Don Dennis, Abhishek Shetty, Anish Prasad Sevekari, Kazuhito Koishida, Virginia Smith

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We empirically evaluate our algorithm on synthetic and real-world classiﬁcation tasks from computer vision, speech, and sensor processing with models suitable for the respective domains.
Researcher Affiliation	Collaboration	Don Kurian Dennis Carnegie Mellon University Abhishek Shetty University of California, Berkeley Anish Sevekari Carnegie Mellon University Kazuhito Koishida Microsoft Virginia Smith Carnegie Mellon University
Pseudocode	Yes	Algorithm 1 B-DISTIL: Main algorithm; Algorithm 2 FIND-WL
Open Source Code	Yes	Our code can be found at: github.com/metastable B/bdistil.
Open Datasets	Yes	Our image classiﬁcation experiments use the CIFAR-10, CIFAR-100, Tiny Image Net and Image Net datasets. For time-series classiﬁcation tasks we use the Google-13 speech commands dataset. Finally, we use the daily sports activities (DSA) dataset for experiments with sensor data.
Dataset Splits	Yes	Except for the pretrained Res Net models, all other teacher models are selected based on performance on validation data. Dataset Train-samples Test/Val-samples Num.-labels Source CIFAR-10 50000 10000 10 [29] CIFAR-100 50000 10000 100 [29] DSA-19 6800 2280 19 [14] Google-13 52886 6835 13 [43] Image Net-1k 1281167 50000 1000 [37] Tiny Image Net-200 100000 10000 200 [30]
Hardware Specification	Yes	For simplicity of presentation, we convert these to the corresponding inference times (τ) on a reference accelerator (NVIDIA 3090Ti).
Software Dependencies	No	The paper mentions using 'Py Torch' and 'torch.autograd.profiler module', but does not specify exact version numbers for these or any other key software dependencies required for reproducibility.
Experiment Setup	Yes	For experiments on CIFAR100 and CIFAR10, we use a learning rate of 0.1, a momentum paramter of 0.9, and weight decay of 5 10 4. We train for 200 epochs and reduce the learning rate by a factor of 0.2 in after 30%, 60% and 90% of the epoch execution. We perform a 4-GPU data-parallel training for Image Net with a per-gpu batch size of 256, learning rate 0.1, momentum 0.9, regularization γ of 1.0, and a weight decaur of 1e 4. We train for 90 epochs with and discount the learning rate by a factor of 0.1 at 30% and 60% epochs. For experiments with time series data, Google-13 and DSA-19, we use a ﬁxed learning rate of 0.05 and a momentum of 0.9. We do not use weight decay or learning rate scheduling for time-series data.