VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
Authors: Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. |
| Researcher Affiliation | Industry | Sicheng Xu Microsoft Research Asia sichengxu@microsoft.com Guojun Chen Microsoft Research Asia guoch@microsoft.com Yu-Xiao Guo Microsoft Research Asia yuxgu@microsoft.com Jiaolong Yang Microsoft Research Asia jiaoyan@microsoft.com Chong Li Microsoft Research Asia chol@microsoft.com Zhenyu Zang Microsoft Research Asia zhenyuzang@microsoft.com Yizhong Zhang Microsoft Research Asia yizzhan@microsoft.com Xin Tong Microsoft Research Asia xtong@microsoft.com Baining Guo Microsoft Research Asia bainguo@microsoft.com |
| Pseudocode | No | The paper describes the methodology but does not contain any structured pseudocode or algorithm blocks labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | Based on the RAI considerations, we will not release our code or data in case of potential misuse, as discussed in Section A. |
| Open Datasets | Yes | For face latent space learning, we use the public Vox Celeb2 dataset from [14] which contains talking face videos from about 6K subjects. |
| Dataset Splits | No | The total data used for training comprises approximately 500K clips, each lasting between 2 to 10 seconds. The parameter counts of our 3D-aided face latent model and diffusion transformer model are about 200M and 29M respectively. |
| Hardware Specification | Yes | Our face latent model takes around 7 days of training on a 4 NVIDIA RTX A6000 GPUs workstation, and the diffusion transformer takes around 3 days. ... evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions using a pretrained feature extractor Wav2Vec2 [3] and Sync Net [15] for evaluation, but does not specify specific software dependencies with version numbers for its implementation. |
| Experiment Setup | Yes | For motion latent generation, we use an 8-layer transformer encoder with an embedding dim 512 and head number 8 as our diffusion network. The model is trained on Vox Celeb2 [14] and another high-resolution talk video dataset collected by us, which contains about 3.5K subjects. In our default setup, the model uses a forward-facing main gaze condition, an average head distance of all training videos, and an empty emotion offset condition. The CFG parameters are set to λA = 0.5 and λg = 1.0, and 50 sampling steps are used. |