Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VASA-3D: Lifelike Audio-Driven Gaussian Head Avatars from a Single Image
Authors: Sicheng Xu, Guojun Chen, Jiaolong Yang, Yizhong Zhang, Yu Deng, Stephen Lin, Baining Guo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to analyze our method as well as the design choices. In the following experiments, we use ten portraits, including five males and five females, generated by Style GAN2 [2] to train our models. We train the models using 4 NVIDIA A100 40G GPUs and a batch size of 4. A 512 x 512 resolution is used for both the training data and VASA-3D rendering throughout this paper. Table 1: Quantitative results of ablation studies. Table 2: Audio-driven generation comparison with VASA-1. Table 3: Comparison with audio-driven 3D talking head methods that are trained on videos. Our method is shown to generate high-quality 3D head renderings with accurate audio-lip sync, vivid facial expressions, and lively head motions. |
| Researcher Affiliation | Industry | Sicheng Xu, Guojun Chen, Jiaolong Yang, Yu Deng, Yizhong Zhang, Stephen Lin, Baining Guo Microsoft Research Asia EMAIL |
| Pseudocode | No | The paper includes diagrams (Figure 1, Figure 2) and mathematical equations, but no explicitly labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | No | We can share our training data since they are synthesized videos, and part of the code for reproduction and comparison. However, due to responsible AI considerations, we are not able to release our full code to prevent potential misuse such as deepfake for fraud. |
| Open Datasets | Yes | For example, in most of our experiments, we randomly sample up to 10 hours of video clips from the Vox Celeb2 dataset [32] to render the training data. We collect 26 portraits, each with 1-minute high-quality talking videos, from the Celeb VHQ [43] dataset. |
| Dataset Splits | Yes | For each image, we use VASA-1 to render eight training datasets of different sizes, i.e., 5min, 10min, 20min, 30min, 1h, 2h, 5h, 10h, using VASA-1 latents extracted from random video clips in Vox Celeb2 [32]. We evaluate the models at varying training iteration numbers (up to 400K) on our test set, which are VASA-1 generated videos of 3min for each image. We only use the first 20-minutes to train the models, and the remaining 5-minutes are used as the test set. |
| Hardware Specification | Yes | We train the models using 4 NVIDIA A100 40G GPUs and a batch size of 4. Given an audio clip as input, the animation and 512 x 512 video frame rendering of our VASA-3D model can run at 75fps with a preceding latency of only 65ms, evaluated on a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | Yes | We use Stable Diffusion v2.1 [36] as the diffusion model in our SDS loss with classifier-free guidance factor 10.0 and gradient scale 0.001. |
| Experiment Setup | Yes | We train the models using 4 NVIDIA A100 40G GPUs and a batch size of 4. A 512 x 512 resolution is used for both the training data and VASA-3D rendering throughout this paper. The CAS loss Lcas is applied after 200K iterations, and the model is fine-tuned for an additional 20K iterations with Lcas and other losses. In all our experiments, the loss weights are set as λssim = 0.1, λlpips = 1.0, λadv = 0.001, λsds = 1.0, λconsist = 0.01, and λcas = 10.0. Our models are trained for 200K iterations by default, excluding the CAS loss finetuning iterations. Gaussian densification and pruning start at the 10K iterations, with intervals of 2K iterations. We stop this process after 100K iterations or when the total number of Gaussians exceeds 200,000. |