Mining on Heterogeneous Manifolds for Zero-Shot Cross-Modal Image Retrieval

Authors: Fan Yang, Zheng Wang, Jing Xiao, Shin'ichi Satoh12589-12596

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we validate our method on visible v.s. thermal datasets and achieves significant performance improvement.
Researcher Affiliation Academia 1The University of Tokyo, Japan 2National Institute of Informatics, Japan
Pseudocode No The paper describes its methods through text and mathematical equations, but it does not include a formal pseudocode or algorithm block.
Open Source Code Yes The code of this paper: https://github.com/fyang93/cross-modal-retrieval
Open Datasets Yes MNIST dataset (Le Cun et al. 1998) (...) SVHN (Street View House Numbers) dataset (Netzer et al. 2011) (...) Reg DB (Nguyen et al. 2017) (...) SYSU-MM01 (Wu et al. 2017)
Dataset Splits Yes Reg DB (...) the entire dataset was divided into a training set and a testing set. (...) SYSU-MM01 (...) The training set contains 22,258 visible images and 11,909 thermal images of 395 persons. The testing set contains 3,803 thermal query images where 96 persons appeared, and 301 visible images randomly sampled for each person as the gallery set. (...) In the MNIST dataset, the numbers of images in the query and gallery set are 3,011 and 18,065 respectively. While the SVHN dataset has 15,299 images in the gallery and 5,274 query images. (...) we consistently set β = 0.2 and the margin of triplet loss to 0.6 through validation.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for experiments.
Software Dependencies No The paper mentions using ResNet18 and ResNet50 as backbones but does not provide specific version numbers for software libraries, frameworks, or programming languages used (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes Res Net18 serves as the backbone CNN for MNIST and SVHN datasets, while Res Net50 is used for Reg DB and SYSU-MM01 datasets (...) The dimension of the output of FC1 is set to 512 for all datasets. (...) The overall loss function for the cross-modal model is L = Lclass + βLtri, (9) where β is a weight on the triplet loss. In our experiments, we consistently set β = 0.2 and the margin of triplet loss to 0.6 through validation.