Similarity Reasoning and Filtration for Image-Text Matching

Authors: Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu1218-1226

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.
Researcher Affiliation Collaboration Haiwen Diao,1 Ying Zhang,2 Lin Ma,3 Huchuan Lu1* 1 Dalian University of Technology, Dalian, China 2 Tencent AI Lab, Shenzhen, China 3 Meituan, Beijing, China
Pseudocode No Not found. The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our implementation of this paper is publicly available on Git Hub at: https://github.com/Paranioar/SGRAF.
Open Datasets Yes We evaluate our model on the MSCOCO (Lin et al. 2014) and Flickr30K (Young et al. 2014) datasets.
Dataset Splits Yes The MSCOCO dataset contains 123,287 images, and each image is annotated with 5 annotated captions. The dataset is split into 113,287 images for training, 5000 images for validation and 5000 images for testing. ... The Flickr30K dataset contains 31,783 images with 5 corresponding captions each. Following the split in (Frome et al. 2013), we use 1,000 images for validation, 1,000 images for testing and the rest for training.
Hardware Specification No Not found. The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No Not found. The paper mentions software tools and optimizers like 'Faster-RCNN', 'Res Net-101', and 'Adam optimizer' but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup Yes For each image, we take the Faster-RCNN (Ren et al. 2015) detector with Res Net-101 provided by (Anderson et al. 2018) to extract the top K = 36 region proposals and obtain a 2048-dimensional feature for each region. For each sentence, we set the word embedding size as 300, and the number of hidden states as 1024. The dimension of similarity representation m is 256, with smooth temperature λ = 9, reasoning steps N = 3, and margin γ = 0.2. Our model employs the Adam optimizer (Kingma and Ba 2015) to train the SGRAF network with the minibatch size of 128. The learning rate is set to be 0.0002 for the first 10 epochs and 0.00002 for the next 10 epochs on MSCOCO. For Flickr30K, we start training the SGR (SAF) module with learning rate 0.0002 for 30 (20) epochs and decay it by 0.1 for the next 10 epochs.