Investigating Why Contrastive Learning Benefits Robustness against Label Noise

Authors: Yihao Xue, Kyle Whitecross, Baharan Mirzasoleiman

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having: (i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve stateof-the-art performance under extreme noise levels, e.g., an average of 27.18% and 15.58% increase in accuracy on CIFAR-10 and CIFAR-100 with 80% symmetric noisy labels, and 4.11% increase in accuracy on Web Vision.
Researcher Affiliation Academia 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. Correspondence to: Yihao Xue <yihaoxue@g.ucla.edu>.
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks.
Open Source Code No The paper mentions "Sim CLR. https://github.com/spijkervet/simclr.", which is a reference to a method they use, not their own developed code for the methodology described in the paper.
Open Datasets Yes We conduct extensive experiments on noisy CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), where noisy labels are generated by random flipping the original ones, and the mini Webvision datasets (Li et al., 2017) which is a benchmark consisting of images crawled from websites, containing real-world noisy labels.
Dataset Splits No The paper specifies the number of training and test images for CIFAR-10/100 (50,000 training, 10,000 test) and mentions using training data for analysis, but does not explicitly describe a separate validation split or its size/methodology for any dataset used in experiments.
Hardware Specification Yes Our method was developed using Py Torch (Paszke et al., 2017). We use 1 Nvidia A40 for all experiments.
Software Dependencies No The paper mentions "Py Torch (Paszke et al., 2017)", "Adam optimizer (Kingma & Ba, 2014)", and "SGD optimizer", but it does not specify version numbers for these software components or libraries, which is required for reproducibility.
Experiment Setup Yes In our experiments, we first pre-train Res Net-32 (He et al., 2016) using Sim CLR (Chen et al., 2020; Sim CLR) for 1000 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 3 10 4, a weight decay of 1 10 6 and a batch size of 128. ... For ELR, we use β = 0.7 for the temporal ensembling parameter, and λ = 3 for the regularization strength. For mixup, we use a mixup strength of α = 1. For Crust, we choose a coreset ratio of 0.5. ... We train Inception Res Net-v2 (Szegedy et al., 2017) for 120 epochs with a starting learning rate of 0.02, which we anneal by a factor of 0.01 at epochs 40 and 80. We use the SGD optimizer with a weight decay of 1 10 3, and a minibatch size of 32.