NLIP: Noise-Robust Language-Image Pre-training

Authors: Runhui Huang, Yanxin Long, Jianhua Han, Hang Xu, Xiwen Liang, Chunjing Xu, Xiaodan Liang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show the significant performance improvements of our NLIP using only 26M data over existing pre-trained models (e.g., CLIP, BLIP) on 12 zero-shot classification datasets (e.g., +8.6% over CLIP on average accuracy), MSCOCO image captioning (e.g., +1.9 over BLIP trained with 129M data on CIDEr) and zero-shot image-text retrieval tasks.
Researcher Affiliation Collaboration Runhui Huang1, Yanxin Long1, Jianhua Han2, Hang Xu2, Xiwen Liang1, Chunjing Xu2, Xiaodan Liang1 * 1 Shenzhen campus of Sun Yat-sen University, 2 Huawei Noah s Ark Lab
Pseudocode No The paper describes the model architecture and training procedure in text and diagrams (Figure 2), but does not provide pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We pre-train NLIP on a 26M subset of YFCC100M named YFCC26M, and the filtering rules follow FILIP (Yao et al. 2021)... we fine-tune NLIP on COCO (Lin et al. 2014) s Karpathy train split (Karpathy and Fei-Fei 2015)...
Dataset Splits Yes We pre-train NLIP on a 26M subset of YFCC100M named YFCC26M... We evaluate our NLIP on the zero-shot image classification and linear probing task on 12 downstream classification datasets... We fine-tune NLIP on COCO (Lin et al. 2014) s Karpathy train split (Karpathy and Fei-Fei 2015)...
Hardware Specification Yes We pre-train our NLIP on 32 Nvidia V100 for 50 epochs with 6144 batch size.
Software Dependencies No We gratefully acknowledge the support of Mind Spore1, CANN (Compute Architecture for Neural Networks) and Ascend AI Processor used for this research.
Experiment Setup Yes We pre-train our NLIP on 32 Nvidia V100 for 50 epochs with 6144 batch size. LAMB (You et al. 2020) optimizer is adopted with a weight decay of 0.05. The base learning rate is set to 0.003 and the scaling rule keeps the same with Yao et al. (2021). The learning rate is linearly warmed up in the first five epochs and then gets decayed by the cosine learning rate schedule (Loshchilov and Hutter 2016). We pre-train NLIP on a 26M subset of YFCC100M named YFCC26M... The training epochs Ee, Et and Ef in different stages are set as 5, 45 and 20, respectively. The weighting factor α and β are both 1 and λ in LNIT C is 0.5.