Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contrastive Language-Image Pre-Training with Knowledge Graphs

Authors: Xuran Pan, Tianzhu Ye, Dongchen Han, Shiji Song, Gao Huang

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.
Researcher Affiliation Academia Xuran Pan Tianzhu Ye Dongchen Han Shiji Song Gao Huang Department of Automation, BNRist, Tsinghua University, Beijing, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The code will be released when the paper is accepted.
Open Datasets Yes Three knowledge graph datasets are adopted in the pre-training process. Visual Sem [2] is a high-quality multi-modal knowledge graph dataset... Visual Genome [24] is a knowledge-based scene graph dataset... Concept Net [46] is a knowledge graph... Besides the three knowledge graph datasets, we also train our model on two widely adopted image-text datasets... We practically add COCO Caption [8] and CC3M [42] to the training set
Dataset Splits Yes We conduct experiments on various downstream tasks, including multi-modal tasks like text and image retrieval, visual question answering, and uni-modal tasks like image classification and natural language understanding. We report results on Flickr30K retrieval task and VQA task with Vi T-B/32 as image encoder.
Hardware Specification No The paper states: 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See supplemental material.' This information is not provided in the main body of the paper.
Software Dependencies No The paper mentions using a '12-layer Transformer model' and 'Vi T-L/14' and a 'BPE tokenizer' but does not specify software dependencies with version numbers like PyTorch, TensorFlow, or CUDA versions.
Experiment Setup Yes In all the experiments, we use the same model structure as CLIP [38]. A 12-layer Transformer model with 512 width is adopted for text encoder, and Vi T-L/14 is adopted for image encoder. For text and image encoder, we use the pre-trained weights in the original CLIP as the initialization. For the multi-modal encoder, we consider a 4 layer Transformer model with 1024 width. The rate for drop path is set as 0.1 during training. ...We train Knowledge-CLIP with an initial learning rate of 1e-5 for image and text encoders, and 1e-3 for the multi-modal encoder. Cosine learning rate with linear warmup is used in the training schedule. Weight decay and gradient clip are also adopted.