Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

Authors: Kaito Takanami, Takashi Takahashi, Ayaka Sakata

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our findings reveal that the primary driver of SD s performance improvement is denoising through hard pseudo-labels, namely discrete labels generated from the model s own predictions, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained Res Net backbone.
Researcher Affiliation Academia Kaito Takanami Department of Physics Graduate School of Science, The University of Tokyo, Tokyo, Japan Center for Interdisciplinary AI and Data Science, Ochanomizu University Tokyo, Japan EMAIL Takashi Takahashi Institute for Physics of Intelligence, The University of Tokyo Tokyo, Japan RIKEN center for AIP Ayaka Sakata Department of Information Science, Ochanomizu University Tokyo, Japan RIKEN center for AIP
Pseudocode Yes Algorithm 1: Update Qhat Column(t): Self-consistent update of column t (past columns are fixed) Input: Results up to time t 1: n z1:t 1 , h1:t 1, G[s, r] = dzr dhs , H[s, r] = dhr dhs (s r t 1) o (153) Output: Results at time t: n G[s, t] = dzt dhs , H[s, t] = dht dhs (s t), Q[s, t] (s t) o (154)
Open Source Code Yes Reproducibility: The codes to reproduce some of our results are available at https://github. com/taka255/self-distillation-analysis.
Open Datasets Yes To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained Res Net backbone. ... fine-tune only the final layer of a Res Net pretrained on IMAGENET1K_V2 maintainers and contributors [2016] (BSD 3-Clause New License) with L2 regularization on noisy CIFAR-10 (cat vs. dog) Krizhevsky et al. [2009] (MIT License)
Dataset Splits No From the pool of noisy embeddings, we uniformly draw M samples (with a fixed class balance when desired) to form the actual training set used in the SD experiments.
Hardware Specification Yes All experiments were executed on CPU workers equipped with an AMD EPYC 9654 processor and 512 GB of main memory.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers. It mentions Torch Vision and PyTorch implicitly for ResNet and ImageNet usage but no versions.
Experiment Setup No We fine-tune only the final layer of a Res Net pretrained on IMAGENET1K_V2 maintainers and contributors [2016] (BSD 3-Clause New License) with L2 regularization on noisy CIFAR-10 (cat vs. dog) Krizhevsky et al. [2009] (MIT License)... The key hyperparameters, λ and β are selected by minimizing the estimated generalization error on the test embeddings via Bayesian optimization.