Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
Authors: Don Dennis, Abhishek Shetty, Anish Prasad Sevekari, Kazuhito Koishida, Virginia Smith
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of B-DISTIL by decomposing pretrained models across standard image, speech, and sensor datasets. We empirically evaluate our algorithm on synthetic and real-world classification tasks from computer vision, speech, and sensor processing with models suitable for the respective domains. |
| Researcher Affiliation | Collaboration | Don Kurian Dennis Carnegie Mellon University Abhishek Shetty University of California, Berkeley Anish Sevekari Carnegie Mellon University Kazuhito Koishida Microsoft Virginia Smith Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 B-DISTIL: Main algorithm; Algorithm 2 FIND-WL |
| Open Source Code | Yes | Our code can be found at: github.com/metastable B/bdistil. |
| Open Datasets | Yes | Our image classification experiments use the CIFAR-10, CIFAR-100, Tiny Image Net and Image Net datasets. For time-series classification tasks we use the Google-13 speech commands dataset. Finally, we use the daily sports activities (DSA) dataset for experiments with sensor data. |
| Dataset Splits | Yes | Except for the pretrained Res Net models, all other teacher models are selected based on performance on validation data. Dataset Train-samples Test/Val-samples Num.-labels Source CIFAR-10 50000 10000 10 [29] CIFAR-100 50000 10000 100 [29] DSA-19 6800 2280 19 [14] Google-13 52886 6835 13 [43] Image Net-1k 1281167 50000 1000 [37] Tiny Image Net-200 100000 10000 200 [30] |
| Hardware Specification | Yes | For simplicity of presentation, we convert these to the corresponding inference times (τ) on a reference accelerator (NVIDIA 3090Ti). |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'torch.autograd.profiler module', but does not specify exact version numbers for these or any other key software dependencies required for reproducibility. |
| Experiment Setup | Yes | For experiments on CIFAR100 and CIFAR10, we use a learning rate of 0.1, a momentum paramter of 0.9, and weight decay of 5 10 4. We train for 200 epochs and reduce the learning rate by a factor of 0.2 in after 30%, 60% and 90% of the epoch execution. We perform a 4-GPU data-parallel training for Image Net with a per-gpu batch size of 256, learning rate 0.1, momentum 0.9, regularization γ of 1.0, and a weight decaur of 1e 4. We train for 90 epochs with and discount the learning rate by a factor of 0.1 at 30% and 60% epochs. For experiments with time series data, Google-13 and DSA-19, we use a fixed learning rate of 0.05 and a momentum of 0.9. We do not use weight decay or learning rate scheduling for time-series data. |