Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation
Authors: Yasheng Sun, Hang Zhou, Ziwei Liu, Hideki Koike
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our approach encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples. |
| Researcher Affiliation | Collaboration | Yasheng Sun1 , Hang Zhou2 , Ziwei Liu3 and Hideki Koike1 1Tokyo Institute of Technology 2CUHK Sensetime Joint Lab, The Chinese University of Hong Kong 3S-Lab, Nanyang Technological University |
| Pseudocode | No | The paper describes its methods using text and mathematical equations but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Please refer to https://hangz-nju-cuhk.github.io/projects/ S2TF for our video code and models. |
| Open Datasets | Yes | We use the popular in-the-wild dataset Vox Celeb2 [Chung et al., 2018] in our experiments. |
| Dataset Splits | Yes | It contains a total of 6,112 celebrities. 5994 speakers are selected for training and 118 speakers are selected for testing. |
| Hardware Specification | Yes | We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Py Torch deep learning framework' but does not specify a version number or other software dependencies with their versions. |
| Experiment Setup | Yes | The length of identity features f a id and f v id are set as 2048 while identity-irrelevant features f v ir and f a s are set as 512. We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs. Images are cropped to 224 224. The audio inputs are mel-spectrograms processed with FFT window size 1280, hop length 160 with 80 Mel filter-banks. A clip of human voice lasting 3.2 seconds is used for our speech-to-identity mapping. During testing, We retrieve an arbitrary image from other identities as the pose source for our generated identity. All λs in loss functions are empirically set to 1. |