Position Focused Attention Network for Image-Text Matching
Authors: Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, Xin Fan
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the popular datasets Flickr30K and MS-COCO show the effectiveness of the proposed method. Besides the public datasets, we also conduct experiments on our collected practical large-scale news dataset (Tencent-News) to validate the practical application value of proposed method. |
| Researcher Affiliation | Collaboration | Yaxiong Wang1,2*, Hao Yang1 , Xueming Qian2, Lin Ma3, Jing Lu1, Biao Li1 and Xin Fan1 1Department of PCG, Tencent 2School of Software Engineering, Xi an Jiaotong University, China 3Tencent AI Lab |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (e.g., a clearly labeled Algorithm section). |
| Open Source Code | Yes | The Tencent News data download link and our code can be found at: https://github.com/Hao Yang0123/Position-Focused-Attention-Network/ |
| Open Datasets | Yes | We evaluate our PFAN on the widely used and authoritative dataset Flickr30K, MS-COCO, the data splits for these two datasets follow the work [Karpathy et al., 2015] and [Lee et al., 2018]. ... The Tencent News data download link and our code can be found at: https://github.com/Hao Yang0123/Position-Focused-Attention-Network/ |
| Dataset Splits | Yes | the data splits for these two datasets follow the work [Karpathy et al., 2015] and [Lee et al., 2018]. ... we collect 143,317 training pairs, and 1,000 pairs for validating and there are 141,736 different images, 130,230 different titles in total. |
| Hardware Specification | No | All of our experiments are conducted on a workstation with NVIDIA Tesla GPU. This describes the type of GPU but does not specify a model number (e.g., Tesla V100) or other specific hardware details. |
| Software Dependencies | No | The paper mentions using "Faster R-CNN model [Ren et al., 2017]", "Res Net-101 [He et al., 2016]", and "Adam optimization algorithm", but does not provide specific software dependencies with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x). |
| Experiment Setup | Yes | The mini-batch size is 128. The image region is extracted by the Faster R-CNN model [Ren et al., 2017], and we retain 36 detected regions for the image representation. Each image is split into 16 16 blocks (𝐾=16), and we set 𝐿 as 15. The dimension of joint embedding is fixed as 1024. The block index is first embedded into 200-dimensional space... the original 2048-dimensional visual vector together with 200-dimensional position feature is mapped into the 1024-dimensional space by a linear projection layer. ... the one-hot vector is first embedded into 300-dimensional dense representation, then the dense representation is fed into the bi-GRU whose hidden dimension is set as 1024 as well. ... we set the embedding size as 512 to get better performance. |