Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets
Authors: Max Ryabinin, Andrey Malinin, Mark Gales
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate En D2 via minimization of the reverse KL-divergence to Proxy Dirichlet target, which we refer to as PD-En D2. We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively. |
| Researcher Affiliation | Collaboration | Max Ryabinin Yandex, HSE University Moscow, Russia mryabinin0@gmail.com Andrey Malinin Yandex, HSE University Moscow, Russia am969@yandex-team.ru Mark Gales Cambridge-ALTA Institute Cambridge, United Kingdom mjfg@eng.cam.ac.uk |
| Pseudocode | No | The paper mentions algorithms and methods but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and training configurations are available at https://github.com/yandex-research/ proxy-dirichlet-distillation. |
| Open Datasets | Yes | We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively. |
| Dataset Splits | Yes | To evaluate the predictive performance of the proposed method, we measure the classification accuracy and Expected Calibration Error (ECE) on the original Image Net validation subset [21] and on a range of distributionally shifted datasets. |
| Hardware Specification | Yes | In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Albumentations library', 'Ada Delta algorithm', and 'Adam optimizer', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Specifically, we train for 90 epochs using stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.1 B/256, where B is the per-device batch size multiplied by the number of GPUs. In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. The learning rate is divided by 10 every 30 epochs. |