Disjoint Mapping Network for Cross-modal Matching of Voices and Faces
Authors: Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, Rita Singh
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that DIMNet is able to achieve better performance than the current state-of-the-art methods, with the additional benefits of being conceptually simpler and less data-intensive. The code is made available at https://github.com/ydwen/DIMNet. Our experiments were conducted on the Voxceleb (Nagrani et al., 2017) and VGGFace (Parkhi et al., 2015) datasets, which are specified in appendix A.1. We ran experiments on matching voices to faces, to evaluate the embeddings derived by DIMNets. |
| Researcher Affiliation | Academia | Yandong Wen , Mahmoud Al Ismail , Weiyang Liu , Bhiksha Raj , Rita Singh Carnegie Mellon University Georgia Institute of Technology yandongw@andrew.cmu.edu, mahmoudi@andrew.cmu.edu, wyliu@gatech.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is made available at https://github.com/ydwen/DIMNet. |
| Open Datasets | Yes | Our experiments were conducted on the Voxceleb (Nagrani et al., 2017) and VGGFace (Parkhi et al., 2015) datasets, which are specified in appendix A.1. We use the intersection of the two datasets... The data are split into train/validation/test sets, following the settings in Nagrani et al. (2018b). Details can be found in Appendix A.1. |
| Dataset Splits | Yes | The data are split into train/validation/test sets, following the settings in Nagrani et al. (2018b). Details can be found in Appendix A.1. ... Table 6: Statistics for the data appearing in Vox Celeb and VGGFace. # of samples train validation test total speech segments 112,697 14,160 21,799 148,656 face images 313,593 36,716 58,420 408,729 |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running experiments. |
| Software Dependencies | No | The paper mentions tools like 'energy-based voice activity detector (Povey et al., 2011)' and 'MTCNN (Zhang et al., 2016)' but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | The detailed network configurations are elaborated in appendix A.3. ... Minibatch size is 256. The momentum and weight decay values are 0.9 and 0.001 respectively. To learn the networks from scratch, the learning rate is initialized at 0.1 and divided by 10 after 16K iterations and again after 24K iterations. The training is completed at 28K iterations. |