Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Authors: Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di ZHANG, Hongyan Liu, Jun He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Omni Sync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos. Our contributions can be summarized as follows: ... A comprehensive AIGC-Lip Sync Benchmark for evaluating lip synchronization in AIgenerated content, including stylized characters and non-human entities. 4 Experiments 4.1 Experimental Settings 4.2 Quantitative Evaluation 4.3 Qualitative Evaluation 4.4 Ablation Study
Researcher Affiliation Collaboration 1Renmin University of China 2Kling Team, Kuaishou Technology 3Tsinghua University
Pseudocode No The paper describes the methodology using mathematical equations and descriptive text, but it does not contain any explicit pseudocode blocks or algorithm listings.
Open Source Code No Due to ongoing anonymization and preparation, the benchmark will be made publicly available upon publication. (This refers to the benchmark, not the code for the methodology.)
Open Datasets Yes Datasets. We trained Omni Sync using the MEAD dataset [42] and a 400-hour dataset collected from You Tube. MEAD's controlled laboratory recordings with diverse facial expressions but minimal head movement provided ideal data for training early denoising stages, while the You Tube dataset enhanced generalization across varied real-world conditions for middle and late stages.
Dataset Splits No The paper describes a "Timestep-Dependent Sampling Strategy" for training, where "p(Vcd, Vab|t) = ppseudo-paired(Vcd, Vab) if t > tthreshold, parbitrary(Vcd, Vab) otherwise" and mentions using "pseudo-paired data from controlled laboratory settings" for early timesteps and "arbitrary videos" for middle and late timesteps. However, it does not provide explicit train/test/validation splits (e.g., percentages or counts) for the overall datasets used.
Hardware Specification Yes Training is completed in 80 hours using 64 NVIDIA A100 GPUs with a batch size of 64.
Software Dependencies No The paper mentions "Audio features are extracted via a pre-trained Whisper encoder, and text conditioning utilizes a T5 encoder" and "Adam W optimizer [21]", but does not specify software versions for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We trained Omni Sync using the MEAD dataset [42] and a 400-hour dataset collected from You Tube. ... The model is trained on a combined dataset for 80,000 steps using Adam W optimizer [21] with a learning rate of 1e-5. Training is completed in 80 hours using 64 NVIDIA A100 GPUs with a batch size of 64. Audio features are extracted via a pre-trained Whisper encoder, and text conditioning utilizes a T5 encoder. Training employs the timestep-dependent sampling threshold tthreshold = 850. During inference we adopt our flow-matching-based progressive noise initialization starting at τ = 0.92, followed by 50 denoising steps. ... γ controls the decay rate, with a value of 1.5.