Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Authors: Max Ryabinin, Andrey Malinin, Mark Gales

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate En D2 via minimization of the reverse KL-divergence to Proxy Dirichlet target, which we refer to as PD-En D2. We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively.
Researcher Affiliation Collaboration Max Ryabinin Yandex, HSE University Moscow, Russia mryabinin0@gmail.com Andrey Malinin Yandex, HSE University Moscow, Russia am969@yandex-team.ru Mark Gales Cambridge-ALTA Institute Cambridge, United Kingdom mjfg@eng.cam.ac.uk
Pseudocode No The paper mentions algorithms and methods but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code and training configurations are available at https://github.com/yandex-research/ proxy-dirichlet-distillation.
Open Datasets Yes We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively.
Dataset Splits Yes To evaluate the predictive performance of the proposed method, we measure the classification accuracy and Expected Calibration Error (ECE) on the original Image Net validation subset [21] and on a range of distributionally shifted datasets.
Hardware Specification Yes In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs.
Software Dependencies No The paper mentions software components like 'Albumentations library', 'Ada Delta algorithm', and 'Adam optimizer', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes Specifically, we train for 90 epochs using stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.1 B/256, where B is the per-device batch size multiplied by the number of GPUs. In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. The learning rate is divided by 10 every 30 epochs.