Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Authors: Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric.
Researcher Affiliation Collaboration Yang Yang1 , Chubing Zhang1 , Yi-Chu Xu2 , Dianhai Yu3 , De-Chuan Zhan2 and Jian Yang1 1Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology, 2Nanjing University, 3Baidu Inc
Pseudocode Yes Algorithm 1 The pseudo code
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes FLICKR25K [Huiskes and Lew, 2008], NUS-WIDE [Chua et al., 2009] and MSCOCO [Lin et al., 2014]
Dataset Splits Yes The dataset is split into 29,783 training images, 1,000 validation images and 1,000 testing images following [Karpathy and Fei-Fei, 2017].
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions using components like Faster R-CNN, Bi GRU, Word2Vec, and Transformer, but does not specify the versions of these or any underlying software frameworks (e.g., PyTorch, TensorFlow) used in implementation.
Experiment Setup Yes Specifically, for image modality, we utilize the pre-trained Faster R-CNN [Lee et al., 2018], which extracts visual regions with pooled ROI embeddings, i.e., the 1024-dimensional feature vector from fc7 layer, denoted as {ˆvt i}Ti t=1 for i th instance, t is the index, Ti is fixed as 36 for all image instance as [Lee et al., 2018] for better performance. We randomly mask input segment with probability of 15% as [Li et al., 2019b] for image and text modalities, and replace the masked ones vm i and wm j with special token [MASK].