VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Authors: Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively.
Researcher Affiliation Industry Sicheng Xu Microsoft Research Asia sichengxu@microsoft.com Guojun Chen Microsoft Research Asia guoch@microsoft.com Yu-Xiao Guo Microsoft Research Asia yuxgu@microsoft.com Jiaolong Yang Microsoft Research Asia jiaoyan@microsoft.com Chong Li Microsoft Research Asia chol@microsoft.com Zhenyu Zang Microsoft Research Asia zhenyuzang@microsoft.com Yizhong Zhang Microsoft Research Asia yizzhan@microsoft.com Xin Tong Microsoft Research Asia xtong@microsoft.com Baining Guo Microsoft Research Asia bainguo@microsoft.com
Pseudocode No The paper describes the methodology but does not contain any structured pseudocode or algorithm blocks labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No Based on the RAI considerations, we will not release our code or data in case of potential misuse, as discussed in Section A.
Open Datasets Yes For face latent space learning, we use the public Vox Celeb2 dataset from [14] which contains talking face videos from about 6K subjects.
Dataset Splits No The total data used for training comprises approximately 500K clips, each lasting between 2 to 10 seconds. The parameter counts of our 3D-aided face latent model and diffusion transformer model are about 200M and 29M respectively.
Hardware Specification Yes Our face latent model takes around 7 days of training on a 4 NVIDIA RTX A6000 GPUs workstation, and the diffusion transformer takes around 3 days. ... evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions using a pretrained feature extractor Wav2Vec2 [3] and Sync Net [15] for evaluation, but does not specify specific software dependencies with version numbers for its implementation.
Experiment Setup Yes For motion latent generation, we use an 8-layer transformer encoder with an embedding dim 512 and head number 8 as our diffusion network. The model is trained on Vox Celeb2 [14] and another high-resolution talk video dataset collected by us, which contains about 3.5K subjects. In our default setup, the model uses a forward-facing main gaze condition, an average head distance of all training videos, and an empty emotion offset condition. The CFG parameters are set to λA = 0.5 and λg = 1.0, and 50 sampling steps are used.