Contrastive Language-Image Pre-Training with Knowledge Graphs
Authors: Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines. |
| Researcher Affiliation | Academia | Xuran Pan Tianzhu Ye Dongchen Han Shiji Song Gao Huang Department of Automation, BNRist, Tsinghua University, Beijing, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code will be released when the paper is accepted. |
| Open Datasets | Yes | Three knowledge graph datasets are adopted in the pre-training process. Visual Sem [2] is a high-quality multi-modal knowledge graph dataset... Visual Genome [24] is a knowledge-based scene graph dataset... Concept Net [46] is a knowledge graph... Besides the three knowledge graph datasets, we also train our model on two widely adopted image-text datasets... We practically add COCO Caption [8] and CC3M [42] to the training set |
| Dataset Splits | Yes | We conduct experiments on various downstream tasks, including multi-modal tasks like text and image retrieval, visual question answering, and uni-modal tasks like image classification and natural language understanding. We report results on Flickr30K retrieval task and VQA task with Vi T-B/32 as image encoder. |
| Hardware Specification | No | The paper states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See supplemental material.' This information is not provided in the main body of the paper. |
| Software Dependencies | No | The paper mentions using a '12-layer Transformer model' and 'Vi T-L/14' and a 'BPE tokenizer' but does not specify software dependencies with version numbers like PyTorch, TensorFlow, or CUDA versions. |
| Experiment Setup | Yes | In all the experiments, we use the same model structure as CLIP [38]. A 12-layer Transformer model with 512 width is adopted for text encoder, and Vi T-L/14 is adopted for image encoder. For text and image encoder, we use the pre-trained weights in the original CLIP as the initialization. For the multi-modal encoder, we consider a 4 layer Transformer model with 1024 width. The rate for drop path is set as 0.1 during training. ...We train Knowledge-CLIP with an initial learning rate of 1e-5 for image and text encoders, and 1e-3 for the multi-modal encoder. Cosine learning rate with linear warmup is used in the training schedule. Weight decay and gradient clip are also adopted. |