FILIP: Fine-grained Interactive Language-Image Pre-Training

Authors: Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS and Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval.
Researcher Affiliation Collaboration 1Huawei Noah s Ark Lab, 2Hong Kong University of Science and Technology 3Sun Yat-sen University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'the LAMB optimizer implemented by the cybertronai s open-source repository (https: //github.com/cybertronai/pytorch-lamb)' but does not state that the code for FILIP itself is open-source or provide a link to it.
Open Datasets Yes We also use 3 public datasets, including Conceptual Captions 3M (CC3M) (Sharma et al., 2018), Conceptual 12M (CC12M) (Changpinyo et al., 2021) and Yahoo Flickr Creative Commons 100M (YFCC100M) (Thomee et al., 2016).
Dataset Splits No The paper describes training and test sets for evaluation but does not explicitly provide details about a dedicated validation dataset split for hyperparameter tuning or early stopping.
Hardware Specification Yes The training is mainly conducted on Nvidia V100 GPUs and Ascend Cards.
Software Dependencies No The paper mentions software like 'LAMB optimizer', 'scikit-learn', and 'pytorch-based codebase', but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Table 8 summarizes the common hyperparameters and Table 9 shows the model- and dataset-specific hyperparameters for FILIP pre-training. Table 10 shows the hyperparameters for image-text retrieval fine-tuning. Table 13 shows the hyperparameters used in linear probe on Image Net.