reproducibilityindex.ai

Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks

Authors: Hao Chen, Jindong Wang, Ankit Shah, Ran Tao, Hongxin Wei, Xing Xie, Masashi Sugiyama, Bhiksha Raj

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Specifically, through extensive experiments of supervised pre-training models on synthetic noisy Image Net-1K and YFCC15M datasets, we demonstrate that while slight noise in pre-training can benefit in-domain (ID) performance, where the training and testing data share the same distribution, it always deteriorates out-of-domain (OOD) performance, where training and testing distributions are different.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Microsoft Research Asia, 3Sus Tech, 4RIKEN AIP, 5The University of Tokyo, 6Mohamed bin Zayed University of AI
Pseudocode	No	The paper describes the proposed method (NMTune) through textual explanation and mathematical equations, but does not include a formal pseudocode block or algorithm.
Open Source Code	No	Additionally, all the generated noisy images and our pre-trained models based on these data are for research purpose only, and will be released per request.
Open Datasets	Yes	We use Image Net-1K (IN-1K) (Russakovsky et al., 2015) in fully supervised pre-training and YFCC15M (Thomee et al., 2016) in CLIP pre-training, with Res Net-50 (He et al., 2016a).
Dataset Splits	No	For ID evaluation, we conduct training on the training set and test on the validation set of the downstream dataset.
Hardware Specification	Yes	All of our experiments on downstream are conducted on single NVIDIA V100 GPU.
Software Dependencies	No	The paper mentions software like Adam W optimizer and TIMM, but does not provide specific version numbers for these or other key software components.
Experiment Setup	Yes	We train the linear classifier for 30 epochs on each downstream dataset, using Adam W (Kingma & Ba, 2014) optimizer with a cosine scheduler. We do not use weight decay for linear probing and set the learning rate to 0.1 for all tasks.We set λ = 0.01 and use 2 layers MLP for all our experiments.