Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective
Authors: Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric. |
| Researcher Affiliation | Collaboration | Yang Yang1 , Chubing Zhang1 , Yi-Chu Xu2 , Dianhai Yu3 , De-Chuan Zhan2 and Jian Yang1 1Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology, 2Nanjing University, 3Baidu Inc |
| Pseudocode | Yes | Algorithm 1 The pseudo code |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository. |
| Open Datasets | Yes | FLICKR25K [Huiskes and Lew, 2008], NUS-WIDE [Chua et al., 2009] and MSCOCO [Lin et al., 2014] |
| Dataset Splits | Yes | The dataset is split into 29,783 training images, 1,000 validation images and 1,000 testing images following [Karpathy and Fei-Fei, 2017]. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using components like Faster R-CNN, Bi GRU, Word2Vec, and Transformer, but does not specify the versions of these or any underlying software frameworks (e.g., PyTorch, TensorFlow) used in implementation. |
| Experiment Setup | Yes | Specifically, for image modality, we utilize the pre-trained Faster R-CNN [Lee et al., 2018], which extracts visual regions with pooled ROI embeddings, i.e., the 1024-dimensional feature vector from fc7 layer, denoted as {ˆvt i}Ti t=1 for i th instance, t is the index, Ti is fixed as 36 for all image instance as [Lee et al., 2018] for better performance. We randomly mask input segment with probability of 15% as [Li et al., 2019b] for image and text modalities, and replace the masked ones vm i and wm j with special token [MASK]. |