Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval

Authors: Kaiyi Lin, Xing Xu, Lianli Gao, Zheng Wang, Heng Tao Shen11515-11522

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model using four benchmark datasets on image-text retrieval tasks and one large-scale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.
Researcher Affiliation Academia Kaiyi Lin,1 Xing Xu,1 Lianli Gao,1 Zheng Wang,1 Heng Tao Shen1 1Center for Future Media and School of Computer Science and Engineering University of Electronic Science and Technology of China, China
Pseudocode Yes Algorithm 1 Training procedure of the proposed LCALE.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing the code for the work described in this paper, nor does it provide a direct link to a source-code repository.
Open Datasets Yes To verify the effectiveness of our proposed method, we conduct experiments under two crossmodal retrieval scenarios: image-text retrieval and imagesketch retrieval. The image-text retrieval is evaluated on four widely-used cross-modal datasets, named Wikipedia (Rasiwasia et al. 2010), Pascal Sentence (Rashtchian, Young, and Hockenmaier 2010), NUS-WIDE (Chua et al. 2009) and PKU-XMedia Net (Huang, Peng, and Yuan 2018). ...For image-sketch retrieval task, we follow the dataset split and feature extraction settings in (Dutta and Akata 2019) to perform experiments on the Sketchy dataset.
Dataset Splits Yes Table 1: The general statistics of all datasets. Here */* denotes the number of seen/unseen Classes , and the number of image/text (or sketch) in Train and Test , respectively. Datasets Classes Train Test Wikipedia 5/5 2,173/2,173 693/693 Pascal Sentences 10/10 800/800 200/200 NUS-WIDE 5/5 42,941/42,941 28,661/28,661 PKU-XMedia Net 100/100 32,000/32,000 8,000/8,000 Sketchy 100/25 58,376/61,060 14,626/14,421
Hardware Specification No The paper mentions that the method is 'implemented using the popular Py Torch toolkit' but does not provide any specific details regarding the hardware used for running the experiments, such as GPU/CPU models, memory, or cloud instances.
Software Dependencies No The paper states, 'We implement our LCALE method using the popular Py Torch toolkit,' but it does not specify the version number for PyTorch or any other software dependencies required to replicate the experiment.
Experiment Setup Yes Details of Network. We implement our LCALE method using the popular Py Torch toolkit. For our network architecture, all encoders contains three fully connected layers with dimensions [4096,2048,64] and activated by Re LU active function. Similarly, all decoders contain three fully connected layers with dimensions [4096,2048,K ] with the layer is activated by Re LU, where = v, t, c, K represents the dimensions of the original image, text, and classes feature, respectively. In addition, we build the regressors of both image and text modalities with three fully connected layers of [4096, 4096, 300] for class-embedding reconstruction with each layer following a Re LU layer. The hyper-parameters α, β, λ and γ are set to 1, 0.1, 0.1, 0.01 respectively and the latent embedding size is set to 64. The learning rate μ is initially set at 0.0001 with weighted decay every 10 epoch.