reproducibilityindex.ai

Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search

Authors: Meiyu Liang, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, Zhe Xue

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.
Researcher Affiliation	Academia	Meiyu Liang, Junping Du*, Zhengyang Liang, Yongwang Xing, Wei Huang, Zhe Xue Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China {meiyu1210, junpingd, 1ce, 1416642324, xuezhe}@bupt.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	Yes	MSCOCO (Lin et al. 2014):This dataset totally contains 123,287 images. Each image is described with five annotated sentences with their annotations classified into 80 categories. We randomly select 5,000 image-text pairs as query set and the remaining ones are used as the retrieval set. Flickr30k (Young et al. 2014): It contains 31,783 images from Flickr website, and each image is described by five different sentences. Following the settings in References(Tu et al. 2022), this dataset is split into 29,783 training images, 1,000 validation images, and 1,000 testing images.
Dataset Splits	Yes	Flickr30k (Young et al. 2014): It contains 31,783 images from Flickr website, and each image is described by five different sentences. Following the settings in References(Tu et al. 2022), this dataset is split into 29,783 training images, 1,000 validation images, and 1,000 testing images.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions software like 'Faster-RCNN', 'Bottom-Up and Top Down (BUTD) attention model', and 'Bert model', but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	The unified hash representation length is set to 16 bits, 32 bits, 64 bits, and 128 bits respectively. For each image, the Faster-RCNN detector provided by Bottom-Up and Top Down (BUTD) attention model are taken to extract R (R = 36) region proposals and obtain a 2,048-dimensional feature for each region. And the BUTD model is pre-trained on Image Net and Visual Genome datasets. For each input text, the basic version of the pre-trained Bert is leveraged to obtain the original word embeddings with dimension 768. The weight α is 0.9. The model is trained with batch size 256. The queue length of momentum encoder hyperparameter K is 8192 for Flickr30k and 65536 for MSCOCO, momentum encoder update hyperparameter m is 0.99, temperature coefficient τ is 0.07. It can be seen that when lr=0.0005, the MAP performance achieved in the retrieval task is the highest. Therefore, we set lr=0.0005 in the experiment. When epoch is 25 approximately, the convergence of cross-modal search algorithm tends to stabilize.