Contrastive Language-Image Pre-Training with Knowledge Graphs

Authors: Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.
Researcher Affiliation Academia Xuran Pan Tianzhu Ye Dongchen Han Shiji Song Gao Huang Department of Automation, BNRist, Tsinghua University, Beijing, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The code will be released when the paper is accepted.
Open Datasets Yes Three knowledge graph datasets are adopted in the pre-training process. Visual Sem [2] is a high-quality multi-modal knowledge graph dataset... Visual Genome [24] is a knowledge-based scene graph dataset... Concept Net [46] is a knowledge graph... Besides the three knowledge graph datasets, we also train our model on two widely adopted image-text datasets... We practically add COCO Caption [8] and CC3M [42] to the training set
Dataset Splits Yes We conduct experiments on various downstream tasks, including multi-modal tasks like text and image retrieval, visual question answering, and uni-modal tasks like image classification and natural language understanding. We report results on Flickr30K retrieval task and VQA task with Vi T-B/32 as image encoder.
Hardware Specification No The paper states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See supplemental material.' This information is not provided in the main body of the paper.
Software Dependencies No The paper mentions using a '12-layer Transformer model' and 'Vi T-L/14' and a 'BPE tokenizer' but does not specify software dependencies with version numbers like PyTorch, TensorFlow, or CUDA versions.
Experiment Setup Yes In all the experiments, we use the same model structure as CLIP [38]. A 12-layer Transformer model with 512 width is adopted for text encoder, and Vi T-L/14 is adopted for image encoder. For text and image encoder, we use the pre-trained weights in the original CLIP as the initialization. For the multi-modal encoder, we consider a 4 layer Transformer model with 1024 width. The rate for drop path is set as 0.1 during training. ...We train Knowledge-CLIP with an initial learning rate of 1e-5 for image and text encoders, and 1e-3 for the multi-modal encoder. Cosine learning rate with linear warmup is used in the training schedule. Weight decay and gradient clip are also adopted.