Disjoint Mapping Network for Cross-modal Matching of Voices and Faces

Authors: Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, Rita Singh

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that DIMNet is able to achieve better performance than the current state-of-the-art methods, with the additional benefits of being conceptually simpler and less data-intensive. The code is made available at https://github.com/ydwen/DIMNet. Our experiments were conducted on the Voxceleb (Nagrani et al., 2017) and VGGFace (Parkhi et al., 2015) datasets, which are specified in appendix A.1. We ran experiments on matching voices to faces, to evaluate the embeddings derived by DIMNets.
Researcher Affiliation Academia Yandong Wen , Mahmoud Al Ismail , Weiyang Liu , Bhiksha Raj , Rita Singh Carnegie Mellon University Georgia Institute of Technology yandongw@andrew.cmu.edu, mahmoudi@andrew.cmu.edu, wyliu@gatech.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is made available at https://github.com/ydwen/DIMNet.
Open Datasets Yes Our experiments were conducted on the Voxceleb (Nagrani et al., 2017) and VGGFace (Parkhi et al., 2015) datasets, which are specified in appendix A.1. We use the intersection of the two datasets... The data are split into train/validation/test sets, following the settings in Nagrani et al. (2018b). Details can be found in Appendix A.1.
Dataset Splits Yes The data are split into train/validation/test sets, following the settings in Nagrani et al. (2018b). Details can be found in Appendix A.1. ... Table 6: Statistics for the data appearing in Vox Celeb and VGGFace. # of samples train validation test total speech segments 112,697 14,160 21,799 148,656 face images 313,593 36,716 58,420 408,729
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running experiments.
Software Dependencies No The paper mentions tools like 'energy-based voice activity detector (Povey et al., 2011)' and 'MTCNN (Zhang et al., 2016)' but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes The detailed network configurations are elaborated in appendix A.3. ... Minibatch size is 256. The momentum and weight decay values are 0.9 and 0.001 respectively. To learn the networks from scratch, the learning rate is initialized at 0.1 and divided by 10 after 16K iterations and again after 24K iterations. The training is completed at 28K iterations.