What Does Your Face Sound Like? 3D Face Shape towards Voice

Authors: Zhihan Yang, Zhiyong Wu, Ying Shan, Jia Jia

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications.
Researcher Affiliation Collaboration Zhihan Yang1, Zhiyong Wu1 , Ying Shan2, Jia Jia3* 1Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China 2Applied Research Center (ARC), Tencent PCG, Shenzhen 518054, China 3Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Pseudocode No The paper describes the proposed framework and methodology in narrative text and diagrams (Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using and links to third-party open-source tools (Kaldi, Face-detection-feature-extraction, ESPnet) but does not state that the code for their specific proposed framework or methodology is open-source or provide a link for it.
Open Datasets Yes The first part of the dataset comes from Vox Celeb2 (Chung, Nagrani, and Zisserman 2018) and VGGFace2 (Cao et al. 2018a). Additionally, we utilize Cha Learn LAP (Ponce-L opez et al. 2016) video dataset, containing videos of more than 30,000 clips.
Dataset Splits Yes We split the first 1,200 and 525 speakers off Vox Celeb2 and Cha Learn LAP for validation, remaining 4,795 and 2,009 speakers for training respectively.
Hardware Specification Yes We train our model on an NVIDIA Geforce 2080 Ti for 50 epochs, with a batch size of 64.
Software Dependencies No The paper mentions several software components like 'Kaldi', 'VGG-19 model', and 'Conformer-Fast Speech2', but it does not provide specific version numbers for these software dependencies or the frameworks used (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We train our model on an NVIDIA Geforce 2080 Ti for 50 epochs, with a batch size of 64. We adopt the Adam optimizer with a learning rate of 0.002.