Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

VividFace: A Robost and High-Fidelity Video Face Swapping Framework

Authors: Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, ZHUOFAN ZONG, Shuo Qin, Yu Liu, Hongsheng Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Vivid Face achieves state-of-the-art performance in identity preservation, temporal consistency, and visual realism, surpassing existing methods while requiring fewer inference steps. Our framework notably mitigates common challenges such as temporal flickering, identity loss, and sensitivity to occlusions and pose variations.
Researcher Affiliation Collaboration Hao Shao1 Shulun Wang2 Yang Zhou2 Guanglu Song2 Dailan He1 Zhuofan Zong1 Shuo Qin2 Yu Liu2 B Hongsheng Li1,3 B 1CUHK MMLab 2Sense Time Research 3CPII under Inno HK
Pseudocode No The paper describes methods and processes in paragraph form and through diagrams, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The AIDT dataset, source code, and pre-trained weights will be released to support future research. The code and pretrained weights are available on the project page.
Open Datasets Yes We collected approximately 550 hours of facial videos from the internet to train our models, and the facial images are partially sourced from VGGFace2-HQ [10]. ... For comparison, since other methods are based on image-level face swapping, we perform face swapping frame by frame for those methods. For facial data reconstruction, we use SSIM, PSNR and LPIPS [54] to evaluate the quality of reconstructed images and videos. ... Furthermore, since our model supports both image and video face swapping, we also evaluate it on the standard FFHQ dataset.
Dataset Splits No Considering that most previous baselines, such as Celeb A [25] and FFHQ [21], are primarily focused on image face swapping, we propose a new benchmark for video face swapping, Vid Swap Bench. Our benchmark includes 200 source images and 200 high-resolution target videos, with each video containing 128 frames and a single trackable face. These videos and images feature unseen identities and backgrounds, ensuring a diverse and challenging dataset. To evaluate performance, we generate 200 swapped videos using our framework. For comparison, since other methods are based on imagelevel face swapping, we perform face swapping frame by frame for those methods.
Hardware Specification Yes The experiments are conducted using 16 NVIDIA A100 GPUs and optimized with Adam W [30].
Software Dependencies No The paper mentions several frameworks and optimizers (Adam W, Arc Face, DINO, SCRFD, Stable Diffusion, Animate Diff) but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We use a latent space of size 13 64 64 and a U-Net architecture for the ϵ0 denoising network. Images and video clips sampled from the dataset are resized and cropped to 512 512. The number of motion frames, M, is set to 4, and the generated video length, T, is set to 8 frames. For the face encoder, the identity network is based on Arc Face [12], while the texture and attribute networks are based on DINO [7]. We use the SCRFD [17] for facial bounding box detection. The mixing coefficients of the ace encoder are set to 1.0 for identity features, and 0.6 for both texture and attribute features. The experiments are conducted using 16 NVIDIA A100 GPUs and optimized with Adam W [30]. In the first stage of the VAE training, the learning rate is set to 5e-6 with a batch size of 32. The weights of reconstruction, perceptual, and KL divergence loss are 1.0, 0.1, 1e-6 respectively. For the second and third stages, the learning rate is increased to 1e-5, with the batch size remaining at 32. During inference, we generate video clips using the DDIM sampling algorithm for 32 steps.