Semi-Supervised Multi-Modal Learning with Balanced Spectral Decomposition

Authors: Peng Hu, Hongyuan Zhu, Xi Peng, Jie Lin99-106

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify the effectiveness of the proposed method, extensive experiments are carried out on three widely-used multimodal datasets comparing with 13 state-of-the-art approaches.
Researcher Affiliation Collaboration 1Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore 2College of Computer Science, Sichuan University, Chengdu 610065, China
Pseudocode Yes Algorithm 1 Optimization procedure of SMLN
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Three multimodal datasets are adopted in our experiments, including the Wikipedia dataset (Rasiwasia et al. 2010), the NUSWIDE dataset (Chua et al. July 8 10 2009), and the XMedia Net dataset (Peng, Qi, and Yuan 2018; Peng, Huang, and Zhao 2017).
Dataset Splits Yes The statistics of the three datasets are summarized in Table 1. We randomly selected 5%, 10% and 30% samples from the training set as labeled data, and the rest samples as unlabeled data. Therefore, there are three groups for each dataset as shown in our experimental results. The image features in our experiments are extracted from the fc7 layer of a 19-layer VGGNet (Krizhevsky, Sutskever, and Hinton 2012) with a dimension of 4, 096. The text representation is extracted by a Doc2Vec model (Lau and Baldwin 2016) pre-trained on Wikipedia with a dimension of 300. Dataset Label Modality Instance Feature Wikipedia 10 Image 2,173/231/462 4,096D VGG Text 2,173/231/462 300D Doc2Vec NUS-WIDE 10 Image 42,941/5,000/23,661 4,096D VGG Text 42,941/5,000/23,661 300D Doc2Vec XMedia Net 200 Image 32,000/4,000/4,000 4,096D VGG Text 32,000/4,000/4,000 300D Doc2Vec Table 1: General statistics of the three datasets used in the experiments, where */*/* in the Instance column stands for the size of training/validation/test subsets.
Hardware Specification Yes The proposed model is trained on two Nvidia GTX 2080Ti GPUs in Py Torch.
Software Dependencies No The paper mentions "Py Torch" but does not specify a version number.
Experiment Setup Yes The batch size Nb is set to 128 for the Wikipedia and NUS-WIDE datasets, and 512 for the XMedia Net dataset. The number of nearest neighbors is set to 2, 3, and 3 for Wikipedia, NUS-WIDE, and XMedia Net, respectively. The dimensionality of the common space is set to c for all the datasets. The learning rate α is set to 10 4 in all the experiments on all datasets. For training, we employ the ADAM (Kingma and Ba 2014) optimizer with the maximal epoch of 200.