Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Authors: Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively.
Researcher Affiliation Collaboration 1School of Software Technology, Zhejiang University 2Fuxi AI Lab, Netease Inc. 3College of Computer Science and Technology, Zhejiang University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical equations for loss functions and encoding.
Open Source Code Yes Our code is available at https://github.com/zjukg/ Structure-CLIP.
Open Datasets Yes We adopt the widely-used cross-modal text-image retrieval dataset, MSCOCO (Lin et al. 2014)... Two novel datasets (Y uksekg on ul et al. 2022) are used to evaluate the structured representation performance of different models, where each test case consists of an image with matched captions and swapped mismatched captions.
Dataset Splits Yes Consistent with prior work (Li et al. 2022), we utilize the Karpathy (Karpathy and Fei-Fei 2017) split for training and evaluation. In our experiment, pre-training is conducted by filtering approximately 100k image-text pairs that involve multiple objects, attributes, and relationships. Subsequently, the models are evaluated on test splits, encompassing 5k images.
Hardware Specification Yes All of our experiments are performed on a single NVIDIA A100 GPU with the Pytorch framework.
Software Dependencies No The paper mentions 'Pytorch framework' and 'BERT-base' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes During the training stage, we initialize the model with a pre-trained CLIP model and train it on our dataset for 10 epochs using a batch size of 128. We use a mini-batch Adam W optimizer with a weight decay of 0.1. The learning rate is initialized as 2e-6. The knowledge weight λ is 0.2.