reproducibilityindex.ai

Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation

Authors: Yasheng Sun, Hang Zhou, Ziwei Liu, Hideki Koike

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our approach encourages better speech-identity correlation learning while generating vivid faces whose identities are consistent with given speech samples.
Researcher Affiliation	Collaboration	Yasheng Sun1 , Hang Zhou2 , Ziwei Liu3 and Hideki Koike1 1Tokyo Institute of Technology 2CUHK Sensetime Joint Lab, The Chinese University of Hong Kong 3S-Lab, Nanyang Technological University
Pseudocode	No	The paper describes its methods using text and mathematical equations but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Please refer to https://hangz-nju-cuhk.github.io/projects/ S2TF for our video code and models.
Open Datasets	Yes	We use the popular in-the-wild dataset Vox Celeb2 [Chung et al., 2018] in our experiments.
Dataset Splits	Yes	It contains a total of 6,112 celebrities. 5994 speakers are selected for training and 118 speakers are selected for testing.
Hardware Specification	Yes	We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs.
Software Dependencies	No	The paper mentions using 'Py Torch deep learning framework' but does not specify a version number or other software dependencies with their versions.
Experiment Setup	Yes	The length of identity features f a id and f v id are set as 2048 while identity-irrelevant features f v ir and f a s are set as 512. We conduct our experiments using Py Torch deep learning framework with eight 16 GB Tesla V100 GPUs. Images are cropped to 224 224. The audio inputs are mel-spectrograms processed with FFT window size 1280, hop length 160 with 80 Mel ﬁlter-banks. A clip of human voice lasting 3.2 seconds is used for our speech-to-identity mapping. During testing, We retrieve an arbitrary image from other identities as the pose source for our generated identity. All λs in loss functions are empirically set to 1.