Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets
Authors: Max Ryabinin, Andrey Malinin, Mark Gales
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate En D2 via minimization of the reverse KL-divergence to Proxy Dirichlet target, which we refer to as PD-En D2. We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively. |
| Researcher Affiliation | Collaboration | Max Ryabinin Yandex, HSE University Moscow, Russia EMAIL Andrey Malinin Yandex, HSE University Moscow, Russia EMAIL Mark Gales Cambridge-ALTA Institute Cambridge, United Kingdom EMAIL |
| Pseudocode | No | The paper mentions algorithms and methods but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and training configurations are available at https://github.com/yandex-research/ proxy-dirichlet-distillation. |
| Open Datasets | Yes | We apply PD-En D2 to ensembles of convolutional networks trained on the Image Net [21] dataset, ensembles of VGG-Transformer [22] ASR models trained on Libr Speech [23] and ensembles of Transformer-big [24] models trained on WMT 17 En-De, which feature 1000, 5000 and 40,000 classes, respectively. |
| Dataset Splits | Yes | To evaluate the predictive performance of the proposed method, we measure the classification accuracy and Expected Calibration Error (ECE) on the original Image Net validation subset [21] and on a range of distributionally shifted datasets. |
| Hardware Specification | Yes | In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Albumentations library', 'Ada Delta algorithm', and 'Adam optimizer', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Specifically, we train for 90 epochs using stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.1 B/256, where B is the per-device batch size multiplied by the number of GPUs. In our experiments, we use a single-GPU batch size of 256 and 8 NVIDIA V100 GPUs. The learning rate is divided by 10 every 30 epochs. |