Cross-Modal Learning with Adversarial Samples

Authors: CHAO LI, Shangqian Gao, Cheng Deng, De Xie, Wei Liu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on two cross-modal benchmark datasets show that the adversarial examples produced by our CMLA are efficient in fooling a target deep cross-modal hashing network.
Researcher Affiliation Collaboration Chao Li1,2 Cheng Deng1, Shangqian Gao2 De Xie1 Wei Liu3, 1School of Electronic Engineering, Xidian University, Xi an, Shaanxi, China 2Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, USA 3Tencent AI Lab, China
Pseudocode Yes Algorithm 1 Cross-Modal correlation Learning with Adversarial samples (CMLA).
Open Source Code No The paper states that source codes of DCMH and SSAH (baselines) were provided by authors, but it does not provide an explicit statement or link for the code of their proposed CMLA method.
Open Datasets Yes Extensive experiments on two benchmarks: MIRFlickr-25K [22] and NUS-WIDE [10] are conducted
Dataset Splits Yes For MIRFlickr-25K, 2,000 data points are randomly selected as a query set, 10,000 data points are used as a training set to train the target retrieval network model, and the remainder is kept as a retrieval database. 5,000 data points from the training set are further sampled to learn adversarial samples. For NUS-WIDE, we randomly sample 2,100 data points as a query set and 10,500 data points as a training set.
Hardware Specification Yes Our proposed CMLA is implemented via Tensor Flow [1] and is run on a server with two NVIDIA Tesla P40 GPUs holding a graphics memory capacity of 24GB for each one.
Software Dependencies Yes Our proposed CMLA is implemented via Tensor Flow [1]
Experiment Setup Yes All images are resized to 224 224 3 before being used as the inputs. In adversarial sample learning, we use the Adam optimizer respectively with initial learning rates 0.5 and 0.002 for the image and text modalities, and train each sample for Tmax iterations. All hyper-parameters α, β, λ, ξ, γ, and η are set as 1 empirically. The mini-batch size is fixed at 128. ϵv is set as 8 for the image modality, and ϵt is set as 0.01 for the text modality.