reproducibilityindex.ai

Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching

Authors: Huatian Zhang, Zhendong Mao, Kun Zhang, Yongdong Zhang3262-3270

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.
Researcher Affiliation	Academia	1University of Science and Technology of China, Hefei, China 2Institute of Artiﬁcial Intelligence, Hefei Comprehensive National Science Center, Hefei, China {huatianzhang, kkzhang}@mail.ustc.edu.cn, {zdmao, zhyd73}@ustc.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source codes will be released.1 1https://github.com/Crossmodal Group/CMCAN
Open Datasets	Yes	We evaluate our method on Flickr30K (Young et al. 2014) and MSCOCO (Lin et al. 2014) datasets.
Dataset Splits	Yes	Following dataset splits in (Lee et al. 2018), we use 29, 000 images for training, 1, 000 images for validation, and 1, 000 images for testing.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions software components like Faster R-CNN, Bi-GRU, and Adam optimizer, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We set the word embedding dimension as 300. The dimension of vision-language shared embedding space D is set as 1024 and the dimension of distance-based similarity vectors P is 256. In region extended semantic representing, we extract K = 3 nearest detected regions in each of the top, bottom, left, and right scopes. ... The layer number L of the self-attentional mechanism for relevance measuring is 3. The Adam optimizer with 0.0002 as the initial learning rate is employed for model optimization. The learning rate is decayed by 10 times after 40 epochs in training on Flickr30K, and after 20 epochs in training on MSCOCO. The margin λ in triplet loss function is empirically set as 0.2.