Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation

Authors: Yasheng Sun, Hang Zhou, Ziwei Liu, Hideki Koike

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our approach encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples.
Researcher Affiliation Collaboration Yasheng Sun1 , Hang Zhou2 , Ziwei Liu3 and Hideki Koike1 1Tokyo Institute of Technology 2CUHK Sensetime Joint Lab, The Chinese University of Hong Kong 3S-Lab, Nanyang Technological University
Pseudocode No The paper describes its methods using text and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Please refer to https://hangz-nju-cuhk.github.io/projects/ S2TF for our video code and models.
Open Datasets Yes We use the popular in-the-wild dataset Vox Celeb2 [Chung et al., 2018] in our experiments.
Dataset Splits Yes It contains a total of 6,112 celebrities. 5994 speakers are selected for training and 118 speakers are selected for testing.
Hardware Specification Yes We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs.
Software Dependencies No The paper mentions using 'Py Torch deep learning framework' but does not specify a version number or other software dependencies with their versions.
Experiment Setup Yes The length of identity features f a id and f v id are set as 2048 while identity-irrelevant features f v ir and f a s are set as 512. We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs. Images are cropped to 224 224. The audio inputs are mel-spectrograms processed with FFT window size 1280, hop length 160 with 80 Mel filter-banks. A clip of human voice lasting 3.2 seconds is used for our speech-to-identity mapping. During testing, We retrieve an arbitrary image from other identities as the pose source for our generated identity. All λs in loss functions are empirically set to 1.