Investigating Why Contrastive Learning Benefits Robustness against Label Noise
Authors: Yihao Xue, Kyle Whitecross, Baharan Mirzasoleiman
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite its empirical success, the theoretical understanding of the effect of contrastive learning on boosting robustness is very limited. In this work, we rigorously prove that the representation matrix learned by contrastive learning boosts robustness, by having: (i) one prominent singular value corresponding to each sub-class in the data, and significantly smaller remaining singular values; and (ii) a large alignment between the prominent singular vectors and the clean labels of each sub-class. The above properties enable a linear layer trained on such representations to effectively learn the clean labels without overfitting the noise. We further show that the low-rank structure of the Jacobian of deep networks pre-trained with contrastive learning allows them to achieve a superior performance initially, when fine-tuned on noisy labels. Finally, we demonstrate that the initial robustness provided by contrastive learning enables robust training methods to achieve stateof-the-art performance under extreme noise levels, e.g., an average of 27.18% and 15.58% increase in accuracy on CIFAR-10 and CIFAR-100 with 80% symmetric noisy labels, and 4.11% increase in accuracy on Web Vision. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles, CA 90095, USA. Correspondence to: Yihao Xue <yihaoxue@g.ucla.edu>. |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions "Sim CLR. https://github.com/spijkervet/simclr.", which is a reference to a method they use, not their own developed code for the methodology described in the paper. |
| Open Datasets | Yes | We conduct extensive experiments on noisy CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), where noisy labels are generated by random flipping the original ones, and the mini Webvision datasets (Li et al., 2017) which is a benchmark consisting of images crawled from websites, containing real-world noisy labels. |
| Dataset Splits | No | The paper specifies the number of training and test images for CIFAR-10/100 (50,000 training, 10,000 test) and mentions using training data for analysis, but does not explicitly describe a separate validation split or its size/methodology for any dataset used in experiments. |
| Hardware Specification | Yes | Our method was developed using Py Torch (Paszke et al., 2017). We use 1 Nvidia A40 for all experiments. |
| Software Dependencies | No | The paper mentions "Py Torch (Paszke et al., 2017)", "Adam optimizer (Kingma & Ba, 2014)", and "SGD optimizer", but it does not specify version numbers for these software components or libraries, which is required for reproducibility. |
| Experiment Setup | Yes | In our experiments, we first pre-train Res Net-32 (He et al., 2016) using Sim CLR (Chen et al., 2020; Sim CLR) for 1000 epochs using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 3 10 4, a weight decay of 1 10 6 and a batch size of 128. ... For ELR, we use β = 0.7 for the temporal ensembling parameter, and λ = 3 for the regularization strength. For mixup, we use a mixup strength of α = 1. For Crust, we choose a coreset ratio of 0.5. ... We train Inception Res Net-v2 (Szegedy et al., 2017) for 120 epochs with a starting learning rate of 0.02, which we anneal by a factor of 0.01 at epochs 40 and 80. We use the SGD optimizer with a weight decay of 1 10 3, and a minibatch size of 32. |