Step-Wise Hierarchical Alignment Network for Image-Text Matching
Authors: Zhong Ji, Kexin Chen, Haoran Wang
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on two benchmark datasets demonstrate the superiority of our proposed method. We conduct experiments on two public datasets: e.g. Flickr30k and MS-COCO, and the quantitative experimental results validate that our model can achieve stateof-the-art performance on both datasets. |
| Researcher Affiliation | Academia | School of Electrical and Information Engineering, Tianjin University, Tianjin, China {jizhong, kxchen, haoranwang}@tju.edu.cn |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide explicit statements or links indicating the release of source code for the described methodology. |
| Open Datasets | Yes | Two benchmark datasets are used in our experiments to testify the performance of our method: (1) Flickr30k contains 31783 images and each image is annotated with 5 sentences. Following [Karpathy and Fei-Fei, 2015], we split the dataset into 1000 test images, 1000 validation images and 29000 training images. (2) MS-COCO is another large-scale image captioning dataset with 123287 images and each image is relative with 5 descriptions. We follow [Lee et al., 2018] to split the dataset into 5000 images for validation, 5000 images for testing and the rest 113287 images for training. |
| Dataset Splits | Yes | Following [Karpathy and Fei-Fei, 2015], we split the dataset into 1000 test images, 1000 validation images and 29000 training images. We follow [Lee et al., 2018] to split the dataset into 5000 images for validation, 5000 images for testing and the rest 113287 images for training. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for experiments were mentioned in the paper. |
| Software Dependencies | No | The paper mentions software like 'Adam optimizer' and 'Glove' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We train our model with Adam optimizer for 30 epochs on each dataset. The dimension of joint embedding space for image regions and textual words are set to 1024, and the dimension of word embeddings is set to be 300, other parameters are empirically set as follows: µ1= 0.3, µ2= 0.5, λ= 15 and m = 0.2. |