Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching
Authors: Huatian Zhang, Zhendong Mao, Kun Zhang, Yongdong Zhang3262-3270
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO. |
| Researcher Affiliation | Academia | 1University of Science and Technology of China, Hefei, China 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China {huatianzhang, kkzhang}@mail.ustc.edu.cn, {zdmao, zhyd73}@ustc.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source codes will be released.1 1https://github.com/Crossmodal Group/CMCAN |
| Open Datasets | Yes | We evaluate our method on Flickr30K (Young et al. 2014) and MSCOCO (Lin et al. 2014) datasets. |
| Dataset Splits | Yes | Following dataset splits in (Lee et al. 2018), we use 29, 000 images for training, 1, 000 images for validation, and 1, 000 images for testing. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions software components like Faster R-CNN, Bi-GRU, and Adam optimizer, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We set the word embedding dimension as 300. The dimension of vision-language shared embedding space D is set as 1024 and the dimension of distance-based similarity vectors P is 256. In region extended semantic representing, we extract K = 3 nearest detected regions in each of the top, bottom, left, and right scopes. ... The layer number L of the self-attentional mechanism for relevance measuring is 3. The Adam optimizer with 0.0002 as the initial learning rate is employed for model optimization. The learning rate is decayed by 10 times after 40 epochs in training on Flickr30K, and after 20 epochs in training on MSCOCO. The margin λ in triplet loss function is empirically set as 0.2. |