Position Focused Attention Network for Image-Text Matching

Authors: Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical large-scale news dataset (Tencent-News) to validate the practical application value of proposed method.
Researcher Affiliation Collaboration Yaxiong Wang1,2*, Hao Yang1 , Xueming Qian2, Lin Ma3, Jing Lu1, Biao Li1 and Xin Fan1 1Department of PCG, Tencent 2School of Software Engineering, Xi an Jiaotong University, China 3Tencent AI Lab
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (e.g., a clearly labeled Algorithm section).
Open Source Code Yes The Tencent News data download link and our code can be found at: https://github.com/Hao Yang0123/Position-Focused-Attention-Network/
Open Datasets Yes We evaluate our PFAN on the widely used and authoritative dataset Flickr30K, MS-COCO, the data splits for these two datasets follow the work [Karpathy et al., 2015] and [Lee et al., 2018]. ... The Tencent News data download link and our code can be found at: https://github.com/Hao Yang0123/Position-Focused-Attention-Network/
Dataset Splits Yes the data splits for these two datasets follow the work [Karpathy et al., 2015] and [Lee et al., 2018]. ... we collect 143,317 training pairs, and 1,000 pairs for validating and there are 141,736 different images, 130,230 different titles in total.
Hardware Specification No All of our experiments are conducted on a workstation with NVIDIA Tesla GPU. This describes the type of GPU but does not specify a model number (e.g., Tesla V100) or other specific hardware details.
Software Dependencies No The paper mentions using "Faster R-CNN model [Ren et al., 2017]", "Res Net-101 [He et al., 2016]", and "Adam optimization algorithm", but does not provide specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x).
Experiment Setup Yes The mini-batch size is 128. The image region is extracted by the Faster R-CNN model [Ren et al., 2017], and we retain 36 detected regions for the image representation. Each image is split into 16 16 blocks (𝐾=16), and we set 𝐿 as 15. The dimension of joint embedding is fixed as 1024. The block index is first embedded into 200-dimensional space... the original 2048-dimensional visual vector together with 200-dimensional position feature is mapped into the 1024-dimensional space by a linear projection layer. ... the one-hot vector is first embedded into 300-dimensional dense representation, then the dense representation is fed into the bi-GRU whose hidden dimension is set as 1024 as well. ... we set the embedding size as 512 to get better performance.