reproducibilityindex.ai

Rethinking Label-Wise Cross-Modal Retrieval from A Semantic Sharing Perspective

Authors: Yang Yang, Chubing Zhang, Yi-Chu Xu, Dianhai Yu, De-Chuan Zhan, Jian Yang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the proposed approach on three representative datasets. The results validate that the proposed semantic sharing can consistently boost the performance under NDCG metric.
Researcher Affiliation	Collaboration	Yang Yang1 , Chubing Zhang1 , Yi-Chu Xu2 , Dianhai Yu3 , De-Chuan Zhan2 and Jian Yang1 1Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, Nanjing University of Science and Technology, 2Nanjing University, 3Baidu Inc
Pseudocode	Yes	Algorithm 1 The pseudo code
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets	Yes	FLICKR25K [Huiskes and Lew, 2008], NUS-WIDE [Chua et al., 2009] and MSCOCO [Lin et al., 2014]
Dataset Splits	Yes	The dataset is split into 29,783 training images, 1,000 validation images and 1,000 testing images following [Karpathy and Fei-Fei, 2017].
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions using components like Faster R-CNN, Bi GRU, Word2Vec, and Transformer, but does not specify the versions of these or any underlying software frameworks (e.g., PyTorch, TensorFlow) used in implementation.
Experiment Setup	Yes	Speciﬁcally, for image modality, we utilize the pre-trained Faster R-CNN [Lee et al., 2018], which extracts visual regions with pooled ROI embeddings, i.e., the 1024-dimensional feature vector from fc7 layer, denoted as {ˆvt i}Ti t=1 for i th instance, t is the index, Ti is ﬁxed as 36 for all image instance as [Lee et al., 2018] for better performance. We randomly mask input segment with probability of 15% as [Li et al., 2019b] for image and text modalities, and replace the masked ones vm i and wm j with special token [MASK].