Understanding Self-Distillation in the Presence of Label Noise

Authors: Rudrajit Das, Sujay Sanghavi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of ξ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that ξ > 1 works better than ξ 1 even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher (w.r.t. accuracy). To our knowledge, this is the first result of its kind for the cross-entropy loss.
Researcher Affiliation Academia 1UT Austin. Correspondence to: Rudrajit Das <rdas@utexas.edu>.
Pseudocode No The paper describes algorithms and derivations in prose and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links about open-source code for their methodology.
Open Datasets Yes We consider multi-class classification with the cross-entropy loss on several vision datasets available in Py Torch s torchvision, namely, CIFAR-100 with 100 classes, Caltech-256 with 257 classes, Food-101 with 101 classes, Stanford Cars with 196 classes and Flowers-102 with 102 classes.
Dataset Splits Yes Since Caltech-256 does not have any default train/test split, we pick 25k random images from the full dataset to form the training set, while the remaining images form the test set. For all datasets, we train a softmax layer on top of a pretrained Res Net-34/VGG-16 model on Image Net which is kept fixed, i.e., we do linear probing on Res Net-34/VGG-16. No data augmentation is involved.
Hardware Specification No The paper mentions training models but does not specify any particular hardware like GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions using "PyTorch's torchvision" but does not specify versions for PyTorch or any other software dependencies.
Experiment Setup Yes We use SGD with momentum = 0.9 and batch size = 128 for training. Since we are training only the softmax layer (i.e., doing logistic regression), we use an exponentially decaying learning rate scheme with decay parameter = 0.98 (for every epoch) and the initial learning rate is tuned over {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}. The maximum number of epochs is 200.