What Does Your Face Sound Like? 3D Face Shape towards Voice
Authors: Zhihan Yang, Zhiyong Wu, Ying Shan, Jia Jia
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications. |
| Researcher Affiliation | Collaboration | Zhihan Yang1, Zhiyong Wu1 , Ying Shan2, Jia Jia3* 1Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China 2Applied Research Center (ARC), Tencent PCG, Shenzhen 518054, China 3Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China |
| Pseudocode | No | The paper describes the proposed framework and methodology in narrative text and diagrams (Figure 1) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using and links to third-party open-source tools (Kaldi, Face-detection-feature-extraction, ESPnet) but does not state that the code for their specific proposed framework or methodology is open-source or provide a link for it. |
| Open Datasets | Yes | The first part of the dataset comes from Vox Celeb2 (Chung, Nagrani, and Zisserman 2018) and VGGFace2 (Cao et al. 2018a). Additionally, we utilize Cha Learn LAP (Ponce-L opez et al. 2016) video dataset, containing videos of more than 30,000 clips. |
| Dataset Splits | Yes | We split the first 1,200 and 525 speakers off Vox Celeb2 and Cha Learn LAP for validation, remaining 4,795 and 2,009 speakers for training respectively. |
| Hardware Specification | Yes | We train our model on an NVIDIA Geforce 2080 Ti for 50 epochs, with a batch size of 64. |
| Software Dependencies | No | The paper mentions several software components like 'Kaldi', 'VGG-19 model', and 'Conformer-Fast Speech2', but it does not provide specific version numbers for these software dependencies or the frameworks used (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | We train our model on an NVIDIA Geforce 2080 Ti for 50 epochs, with a batch size of 64. We adopt the Adam optimizer with a learning rate of 0.002. |