Ranking-Based Deep Cross-Modal Hashing

Authors: Xuanwu Liu, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Yazhou Ren, Maozu Guo4400-4407

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on real multi-modal datasets show that RDCMH outperforms other competitive baselines and achieves the state-of-the-art performance in cross-modal retrieval applications.
Researcher Affiliation Academia 1College of Computer and Information Sciences, Southwest University, Chongqing, China 2Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan, China 3Department of Computer Science, George Mason University, Fairfax, USA 4SMILE Lab, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China 5School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
Pseudocode Yes Algorithm 1 RDCMH: Ranking based Deep Cross-Modal Hashing
Open Source Code Yes The code of RDCMH is available at mlda.swu.edu.cn/codes.php?name=RDCMH.
Open Datasets Yes We use three benchmark datasets: Nus-wide, Pascal VOC, and Mirflicker to evaluate the performance of RDCMH. Nus-wide1 contains 260,648 web images, and some images are associated with textual tags. It is a multi-label dataset where each point is annotated with one or several labels from 81 concept labels. The text for each point is represented as a 1000-dimensional bag-of-words vector. The hand-crafted feature for each image is a 500-dimensional bag-of-visual words (BOVW) vector. Wiki2 is generated from a group of 2866 Wikipedia documents. Each document is an image-text pair labeled with 10 semantic classes. The images are represented by 128-dimensional SIFT feature vectors. The text articles are represented as probability distributions over 10 topics, which are derived from a Latent Dirichlet Allocation (LDA) model. Mirflickr3 originally contains 25,000 instances collected from Flicker. Each instance consists of an image and its associated textual tags, and is manually annotated with one or more labels, from a total of 24 semantic labels. The text for each point is represented as a 1386-dimensional bagof-words vector. For the hand-crafted feature based method, each image is represented by a 512-dimensional GIST feature vector. 1http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm 2https://www.wikidata.org/wiki/Wikidata 3http://press.liacs.nl/mirflickr/mirdownload.html
Dataset Splits No The paper mentions 'training set' and 'mini-batch size for gradient descent to 128' and uses 'semi-supervised semantic ranking list used for training', but does not specify explicit training/validation/test splits or a validation set.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions 'CNN', 'Alex Net', and 'bag-of-words (BOW) representation' for feature learning, but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes As to RDCMH, we set the minibatch size for gradient descent to 128, and set dropout rate as 0.5 on the fully connected layers to avoid overfitting. The regularization parameter λ in Eq. (4) is set to 1, and the number of iterations for optimizing Eq. (4) is fixed to 500. The adopted deep neural network for image modality is a CNN, which includes eight layers. The first six layers are the same as those in CNN-F(Chatfield et al. 2014). The seventh and eighth layer is a fully-connected layer with the outputs being the learned image features. As to the text modality, we first represent each text as a vector with bag-of-words (BOW) representation. Next, the bag-of-words vectors are used as the inputs for a neural network with two fully-connected layers, denoted as full1 full2 . The full1 layer has 4096 neurons, and the second layer full2 has c (hashing codes) neurons, The activation function for the first layer is Re LU, and that for the second layer is the identity function.