Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks
Authors: Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, through extensive experiments of supervised pre-training models on synthetic noisy Image Net-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are different. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2Microsoft Research Asia, 3Sus Tech, 4RIKEN AIP, 5The University of Tokyo, 6Mohamed bin Zayed University of AI |
| Pseudocode | No | The paper describes the proposed method (NMTune) through textual explanation and mathematical equations, but does not include a formal pseudocode block or algorithm. |
| Open Source Code | No | Additionally, all the generated noisy images and our pre-trained models based on these data are for research purpose only, and will be released per request. |
| Open Datasets | Yes | We use Image Net-1K (IN-1K) (Russakovsky et al., 2015) in fully supervised pre-training and YFCC15M (Thomee et al., 2016) in CLIP pre-training, with Res Net-50 (He et al., 2016a). |
| Dataset Splits | No | For ID evaluation, we conduct training on the training set and test on the validation set of the downstream dataset. |
| Hardware Specification | Yes | All of our experiments on downstream are conducted on single NVIDIA V100 GPU. |
| Software Dependencies | No | The paper mentions software like Adam W optimizer and TIMM, but does not provide specific version numbers for these or other key software components. |
| Experiment Setup | Yes | We train the linear classifier for 30 epochs on each downstream dataset, using Adam W (Kingma & Ba, 2014) optimizer with a cosine scheduler. We do not use weight decay for linear probing and set the learning rate to 0.1 for all tasks.We set λ = 0.01 and use 2 layers MLP for all our experiments. |