Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels

Authors: Hyeonsu Jeong, Hye Won Chung

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our findings are supported by experiments on synthetic and real datasets (Figure 3 and Section 6). In this section, we validate our theoretical findings through experiments on real datasets.
Researcher Affiliation	Academia	Hyeonsu Jeong & Hye Won Chung School of Electrical Engineering Korea Advanced Institute of Science and Technology (KAIST) Daejeon, South Korea EMAIL
Pseudocode	Yes	Algorithm 1: Optimal Output Calculation by Numerical Method
Open Source Code	Yes	Our code is publicly available at https://github.com/Hyeonsu-Jeong/Self-PLL.
Open Datasets	Yes	We conduct experiments on six multi-class image classification benchmarks: CIFAR-100 (Krizhevsky et al., 2009), Caltech-101/256 (Griffin et al., 2007), Flowers-102 (Nilsback & Zisserman, 2008), Food-101 (Bossard et al., 2014), and Stanford Cars (Krause et al., 2013), utilizing the PyTorch torchvision library.
Dataset Splits	Yes	For each category, there are about 40 to 800 samples, with an average of 50 samples per category. We removed the background category and divided the dataset into training and validation sets using an 8:2 ratio. ... Similar to the Caltech-101 dataset, we removed the clutter category and divided the dataset into training and validation sets using an 8:2 ratio. ... Since the test set of the Flowers-102 dataset is larger than the training set, we swapped the training and test sets for use.
Hardware Specification	Yes	Our neural networks are trained using multiple NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions using PyTorch torchvision library and SGD optimizer, but does not specify their version numbers. It also mentions using Generalized Cross Entropy(GCE) loss with hyperparameter q = 0.7, but this is a method, not a software dependency.
Experiment Setup	Yes	We perform a grid search over learning rates, in the set {0.1, 0.05, 0.01, 0.005, 0.001}. Each model is trained for 200 epochs, employing the SGD optimizer with a momentum value of 0.9. In our experiments, we observe that using CE loss with the PLL student model often leads to instability during training. The PLL student model trains with a set of candidate labels for each sample with equal weights in our case, the top two labels with weights of 1/2 each. Using CE loss with equally weighted candidate labels can cause instability since the model may converge incorrectly when the candidate set includes incorrect labels. Hence, for the stable convergence of the PLL student model, we use Generalized Cross Entropy(GCE) (Zhang & Sabuncu, 2018) loss with the hyperparameter q = 0.7.