Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking

Authors: Zhongjian Wang, Peng Zhang, Jinwei Qi, wang yuan, Sheng Xu, Bang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate its superiority over existing methods in terms of generation quality, particularly in preserving style consistency and ensuring precise audio-video synchronization, all while maintaining efficient inference.
Researcher Affiliation Industry Zhongjian Wang Peng Zhang Jinwei Qi Guangyuan Wang Sheng Xu Bang Zhang Tongyi Lab, Alibaba Group
Pseudocode No The paper includes architectural diagrams (Figure 1) and mathematical equations, but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No We will also release the code after the review phase.
Open Datasets Yes We evaluated our method on both audio and video generation tasks. For audio, we employed the SEED[1] dataset, while for video generation, 500 clips were randomly selected from Vox Celeb2[8] as the test set. To further validate the robustness of our approach in practical applications, 100 video clips from Chinese real-world scenarios were selected as Custom dataset for supplementary testing of audio and video quality. We pre-trained our model on a collection of largescale open-source talking-head datasets and subsequently developed a high-quality dataset(totaling 690 hours) to fine-tune the model s performance. We direct the authors to Appendix C.1 for more details. In Appendix C.1: We employ Talking Head-1KH [52], Vox Celeb [36], and Celeb V-HQ [65] as pre-training datasets, while constructing a high-quality custom dataset for fine-tuning.
Dataset Splits Yes For audio, we employed the SEED[1] dataset, while for video generation, 500 clips were randomly selected from Vox Celeb2[8] as the test set. To further validate the robustness of our approach in practical applications, 100 video clips from Chinese real-world scenarios were selected as Custom dataset for supplementary testing of audio and video quality.
Hardware Specification Yes The model was trained on 8 NVIDIA A100 GPUs using a batch size of 12,800 frames per GPU for 750,000 iterations with a learning rate of 1e 4... The proposed architecture achieves real-time inference on a single NVIDIA RTX 4090 GPU.
Software Dependencies No The paper mentions several tools and models used (e.g., Vocos[42], Conv Ne Xt-V2[55], Whisper-Large-v3[38], Py Scene Detect [10], Insightface [17], Light ASD [29], Face Verse [49]), but it does not provide specific version numbers for any key software components or libraries.
Experiment Setup Yes Our model includes 22 audio-visual fusion blocks with two parallel branches (4 single-modality Di T blocks each), 512-dim embeddings (audio/visual via linear layers, text via 4 Conv Ne Xt V2 blocks), totaling 0.8B parameters. The model was trained on 8 NVIDIA A100 GPUs using a batch size of 12,800 frames per GPU for 750,000 iterations with a learning rate of 1e 4. In Equation (3), we use λm = 0.1, λf = 3.0, λh = 0.5 and λe = 0.5. The audio features are represented as 100-D(F = 100) log-mel filterbank coefficients extracted with 24k Hz sampling rate and hop length 256, yielding an audio sequence with approximately 94 fps. In contrast, visual codes are captured at 30 fps and upsampled to 94 fps to match the audio frame rate. During inference, the duration of reference audio-visual content is restricted to 1 10 seconds, with any excess truncated. The CFG parameters are set to αMr = 2.0, αCr = 2.5 and αST = 2.0, while 16 sampling steps are used. We provide full details in Appendix C.