Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking
Authors: Zhongjian Wang, Peng Zhang, Jinwei Qi, wang yuan, Sheng Xu, Bang Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate its superiority over existing methods in terms of generation quality, particularly in preserving style consistency and ensuring precise audio-video synchronization, all while maintaining efficient inference. |
| Researcher Affiliation | Industry | Zhongjian Wang Peng Zhang Jinwei Qi Guangyuan Wang Sheng Xu Bang Zhang Tongyi Lab, Alibaba Group |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1) and mathematical equations, but it does not present any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | We will also release the code after the review phase. |
| Open Datasets | Yes | We evaluated our method on both audio and video generation tasks. For audio, we employed the SEED[1] dataset, while for video generation, 500 clips were randomly selected from Vox Celeb2[8] as the test set. To further validate the robustness of our approach in practical applications, 100 video clips from Chinese real-world scenarios were selected as Custom dataset for supplementary testing of audio and video quality. We pre-trained our model on a collection of largescale open-source talking-head datasets and subsequently developed a high-quality dataset(totaling 690 hours) to fine-tune the model s performance. We direct the authors to Appendix C.1 for more details. In Appendix C.1: We employ Talking Head-1KH [52], Vox Celeb [36], and Celeb V-HQ [65] as pre-training datasets, while constructing a high-quality custom dataset for fine-tuning. |
| Dataset Splits | Yes | For audio, we employed the SEED[1] dataset, while for video generation, 500 clips were randomly selected from Vox Celeb2[8] as the test set. To further validate the robustness of our approach in practical applications, 100 video clips from Chinese real-world scenarios were selected as Custom dataset for supplementary testing of audio and video quality. |
| Hardware Specification | Yes | The model was trained on 8 NVIDIA A100 GPUs using a batch size of 12,800 frames per GPU for 750,000 iterations with a learning rate of 1e 4... The proposed architecture achieves real-time inference on a single NVIDIA RTX 4090 GPU. |
| Software Dependencies | No | The paper mentions several tools and models used (e.g., Vocos[42], Conv Ne Xt-V2[55], Whisper-Large-v3[38], Py Scene Detect [10], Insightface [17], Light ASD [29], Face Verse [49]), but it does not provide specific version numbers for any key software components or libraries. |
| Experiment Setup | Yes | Our model includes 22 audio-visual fusion blocks with two parallel branches (4 single-modality Di T blocks each), 512-dim embeddings (audio/visual via linear layers, text via 4 Conv Ne Xt V2 blocks), totaling 0.8B parameters. The model was trained on 8 NVIDIA A100 GPUs using a batch size of 12,800 frames per GPU for 750,000 iterations with a learning rate of 1e 4. In Equation (3), we use λm = 0.1, λf = 3.0, λh = 0.5 and λe = 0.5. The audio features are represented as 100-D(F = 100) log-mel filterbank coefficients extracted with 24k Hz sampling rate and hop length 256, yielding an audio sequence with approximately 94 fps. In contrast, visual codes are captured at 30 fps and upsampled to 94 fps to match the audio frame rate. During inference, the duration of reference audio-visual content is restricted to 1 10 seconds, with any excess truncated. The CFG parameters are set to αMr = 2.0, αCr = 2.5 and αST = 2.0, while 16 sampling steps are used. We provide full details in Appendix C. |