Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Authors: Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Specifically, through extensive experiments of supervised pre-training models on synthetic noisy Image Net-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are different.
Researcher Affiliation Collaboration 1Carnegie Mellon University, 2Microsoft Research Asia, 3Sus Tech, 4RIKEN AIP, 5The University of Tokyo, 6Mohamed bin Zayed University of AI
Pseudocode No The paper describes the proposed method (NMTune) through textual explanation and mathematical equations, but does not include a formal pseudocode block or algorithm.
Open Source Code No Additionally, all the generated noisy images and our pre-trained models based on these data are for research purpose only, and will be released per request.
Open Datasets Yes We use Image Net-1K (IN-1K) (Russakovsky et al., 2015) in fully supervised pre-training and YFCC15M (Thomee et al., 2016) in CLIP pre-training, with Res Net-50 (He et al., 2016a).
Dataset Splits No For ID evaluation, we conduct training on the training set and test on the validation set of the downstream dataset.
Hardware Specification Yes All of our experiments on downstream are conducted on single NVIDIA V100 GPU.
Software Dependencies No The paper mentions software like Adam W optimizer and TIMM, but does not provide specific version numbers for these or other key software components.
Experiment Setup Yes We train the linear classifier for 30 epochs on each downstream dataset, using Adam W (Kingma & Ba, 2014) optimizer with a cosine scheduler. We do not use weight decay for linear probing and set the learning rate to 0.1 for all tasks.We set λ = 0.01 and use 2 layers MLP for all our experiments.