reproducibilityindex.ai

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Authors: Max Ryabinin, Andrey Malinin, Mark Gales

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate En D2 via minimization of the reverse KL-divergence to Proxy Dirichlet target, which we refer to as PD-En D2. We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively.
Researcher Affiliation	Collaboration	Max Ryabinin Yandex, HSE University Moscow, Russia mryabinin0@gmail.com Andrey Malinin Yandex, HSE University Moscow, Russia am969@yandex-team.ru Mark Gales Cambridge-ALTA Institute Cambridge, United Kingdom mjfg@eng.cam.ac.uk
Pseudocode	No	The paper mentions algorithms and methods but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and training conﬁgurations are available at https://github.com/yandex-research/ proxy-dirichlet-distillation.
Open Datasets	Yes	We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively.
Dataset Splits	Yes	To evaluate the predictive performance of the proposed method, we measure the classiﬁcation accuracy and Expected Calibration Error (ECE) on the original Image Net validation subset [21] and on a range of distributionally shifted datasets.
Hardware Specification	Yes	In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs.
Software Dependencies	No	The paper mentions software components like 'Albumentations library', 'Ada Delta algorithm', and 'Adam optimizer', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Speciﬁcally, we train for 90 epochs using stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.1 B/256, where B is the per-device batch size multiplied by the number of GPUs. In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. The learning rate is divided by 10 every 30 epochs.