Understanding Self-Distillation in the Presence of Label Noise
Authors: Rudrajit Das, Sujay Sanghavi
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we theoretically characterize the effect of SD in two supervised learning problems with noisy labels. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of ξ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that ξ > 1 works better than ξ 1 even with the cross-entropy loss for several classification datasets when 50% or 30% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher (w.r.t. accuracy). To our knowledge, this is the first result of its kind for the cross-entropy loss. |
| Researcher Affiliation | Academia | 1UT Austin. Correspondence to: Rudrajit Das <rdas@utexas.edu>. |
| Pseudocode | No | The paper describes algorithms and derivations in prose and mathematical equations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements or links about open-source code for their methodology. |
| Open Datasets | Yes | We consider multi-class classification with the cross-entropy loss on several vision datasets available in Py Torch s torchvision, namely, CIFAR-100 with 100 classes, Caltech-256 with 257 classes, Food-101 with 101 classes, Stanford Cars with 196 classes and Flowers-102 with 102 classes. |
| Dataset Splits | Yes | Since Caltech-256 does not have any default train/test split, we pick 25k random images from the full dataset to form the training set, while the remaining images form the test set. For all datasets, we train a softmax layer on top of a pretrained Res Net-34/VGG-16 model on Image Net which is kept fixed, i.e., we do linear probing on Res Net-34/VGG-16. No data augmentation is involved. |
| Hardware Specification | No | The paper mentions training models but does not specify any particular hardware like GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions using "PyTorch's torchvision" but does not specify versions for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We use SGD with momentum = 0.9 and batch size = 128 for training. Since we are training only the softmax layer (i.e., doing logistic regression), we use an exponentially decaying learning rate scheme with decay parameter = 0.98 (for every epoch) and the initial learning rate is tuned over {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}. The maximum number of epochs is 200. |