Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search
Authors: Meiyu Liang, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, Zhe Xue
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods. |
| Researcher Affiliation | Academia | Meiyu Liang, Junping Du*, Zhengyang Liang, Yongwang Xing, Wei Huang, Zhe Xue Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China {meiyu1210, junpingd, 1ce, 1416642324, xuezhe}@bupt.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | MSCOCO (Lin et al. 2014):This dataset totally contains 123,287 images. Each image is described with five annotated sentences with their annotations classified into 80 categories. We randomly select 5,000 image-text pairs as query set and the remaining ones are used as the retrieval set. Flickr30k (Young et al. 2014): It contains 31,783 images from Flickr website, and each image is described by five different sentences. Following the settings in References(Tu et al. 2022), this dataset is split into 29,783 training images, 1,000 validation images, and 1,000 testing images. |
| Dataset Splits | Yes | Flickr30k (Young et al. 2014): It contains 31,783 images from Flickr website, and each image is described by five different sentences. Following the settings in References(Tu et al. 2022), this dataset is split into 29,783 training images, 1,000 validation images, and 1,000 testing images. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like 'Faster-RCNN', 'Bottom-Up and Top Down (BUTD) attention model', and 'Bert model', but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The unified hash representation length is set to 16 bits, 32 bits, 64 bits, and 128 bits respectively. For each image, the Faster-RCNN detector provided by Bottom-Up and Top Down (BUTD) attention model are taken to extract R (R = 36) region proposals and obtain a 2,048-dimensional feature for each region. And the BUTD model is pre-trained on Image Net and Visual Genome datasets. For each input text, the basic version of the pre-trained Bert is leveraged to obtain the original word embeddings with dimension 768. The weight α is 0.9. The model is trained with batch size 256. The queue length of momentum encoder hyperparameter K is 8192 for Flickr30k and 65536 for MSCOCO, momentum encoder update hyperparameter m is 0.99, temperature coefficient τ is 0.07. It can be seen that when lr=0.0005, the MAP performance achieved in the retrieval task is the highest. Therefore, we set lr=0.0005 in the experiment. When epoch is 25 approximately, the convergence of cross-modal search algorithm tends to stabilize. |