Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
Authors: Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, Wen Zhang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. |
| Researcher Affiliation | Collaboration | 1School of Software Technology, Zhejiang University 2Fuxi AI Lab, Netease Inc. 3College of Computer Science and Technology, Zhejiang University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical equations for loss functions and encoding. |
| Open Source Code | Yes | Our code is available at https://github.com/zjukg/ Structure-CLIP. |
| Open Datasets | Yes | We adopt the widely-used cross-modal text-image retrieval dataset, MSCOCO (Lin et al. 2014)... Two novel datasets (Y uksekg on ul et al. 2022) are used to evaluate the structured representation performance of different models, where each test case consists of an image with matched captions and swapped mismatched captions. |
| Dataset Splits | Yes | Consistent with prior work (Li et al. 2022), we utilize the Karpathy (Karpathy and Fei-Fei 2017) split for training and evaluation. In our experiment, pre-training is conducted by filtering approximately 100k image-text pairs that involve multiple objects, attributes, and relationships. Subsequently, the models are evaluated on test splits, encompassing 5k images. |
| Hardware Specification | Yes | All of our experiments are performed on a single NVIDIA A100 GPU with the Pytorch framework. |
| Software Dependencies | No | The paper mentions 'Pytorch framework' and 'BERT-base' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | During the training stage, we initialize the model with a pre-trained CLIP model and train it on our dataset for 10 epochs using a batch size of 128. We use a mini-batch Adam W optimizer with a weight decay of 0.1. The learning rate is initialized as 2e-6. The knowledge weight λ is 0.2. |