Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Knowledge Distillation of Uncertainty using Deep Latent Factor Model

Authors: Sehyun Park, Jongjin Lee, Yunseop Shin, Ilsang Ohn, Yongdai Kim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we investigate Gaussian distillation by analyzing multiple benchmark datasets. We compare Gaussian distillation with existing baselines including the naive distillation (one-to-one distillation without sharing weights between student DNNs, small-Ens), Hydra [11] and BE [12] for regression and classification problems as well as fine-tuning of language models in view of uncertainty quantification. For classification, we also evaluate Proxy Dirichlet Distillation (Proxy-End2) [18] and Ensemble Distillation via Flow Matching (EDFM) [45]. In addition, we show that a pre-trained DLF outperforms its competitors for distribution shift problems.
Researcher Affiliation	Collaboration	Sehyun Park Department of Statistics Seoul National University EMAIL Jongjin Lee Samsung Research EMAIL Yunseop Shin Department of Statistics Seoul National University EMAIL Ilsang Ohn Department of Statistics Inha University EMAIL Yongdai Kim Department of Statistics Seoul National University EMAIL
Pseudocode	Yes	Algorithm 1: EM algorithm for the univariate DLF model
Open Source Code	Yes	2The source code of DLF is publicly available at https://github.com/sehyun1094/DLF
Open Datasets	Yes	Datasets We analyze six benchmark datasets from the UCI repository [46] including Boston housing, Concrete, Energy, Wine, Power Plant, and Kin8nm. Datasets CIFAR-10 and CIFAR-100 consist of 50,000 training and 10,000 test images. Datasets We analyze three GLUE [50] and Super GLUE [51] sub-tasks: RTE, MRPC, and Wi C.
Dataset Splits	Yes	Each dataset is randomly split into 90% training and 10% testing... In this experiment, the training data are further split into 80% training and 20% validation... The entire dataset is partitioned into three disjoint subsets: D = Dtrain teacher Dtrain new Dtest, with a fixed ratio of 4.5 : 4.5 : 1.
Hardware Specification	Yes	All our experiments are done through Python 3.9.16 with Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz, NVIDIA TITAN Xp GPU and 128GB RAM.
Software Dependencies	Yes	All our experiments are done through Python 3.9.16 with Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz, NVIDIA TITAN Xp GPU and 128GB RAM. The Adam [53] is used for the optimization.
Experiment Setup	Yes	We obtain 50 teacher models of DNNs with two hidden layers and 100 nodes at each layer... The architecture of student models comprises of an one-hidden-layer MLP with 50 units. Training lasts 200 epochs on a single GPU using SGD with Nesterov momentum of 0.9, weight decay of 5e-4, and batch size of 128. A one-cycle cosine annealing schedule with a five-epoch linear warm-up (from 0.001 to 0.1) is employed.