Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

Authors: Zhe Kong, Feng Gao, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Xunliang Cai, Guanying Chen, Wenhan Luo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct evaluations on various datasets, such as talking face, talking body, and multi-person conversation. The results demonstrate the effectiveness of the proposed method.
Researcher Affiliation Collaboration Zhe Kong1,2,3 , Feng Gao2 , Yong Zhang2 , Zhuoliang Kang2, Xiaoming Wei2, Xunliang Cai2, Guanying Chen1, Wenhan Luo3 1Shenzhen Campus of Sun Yat-sen University 2Meituan 3Division of AMC and Department of ECE, HKUST
Pseudocode No The paper describes the methodology in detail across sections 3.1 to 3.5, including architectural components and training strategies, but it does not present any explicitly labeled pseudocode or algorithm blocks. The pipeline is illustrated in Figure 2 and Figure 3.
Open Source Code No The NeurIPS Paper Checklist for question 5 states: Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: [No]
Open Datasets Yes For the talking head dataset, we employ two publicly available datasets, HDTF [43], and Celeb V-HQ [44] for evaluation purposes. For the talking body dataset, we utilize the EMTD [10] dataset. ... All data used in our experiments were collected from publicly available sources on the internet. Our data collection process follows the best practices established by previous works [49, 50, 51], ensuring that our methods are consistent with the standards in the community. All data sources are under the CC BY 4.0 International license.
Dataset Splits Yes We collect a video dataset of about 2K hours for the first stage training, which covers the face or body of a single talking person. We also collect about 200K video clips that contain multiple events and human-object/environment interactions. The average clip duration is about 10 seconds. For the second stage training, we collect 100 hours of videos consisting of conversations between two persons. For evaluation, we employ three distinct types of testing datasets: the talking head dataset, the talking body dataset, and the dual-human talking body dataset with interactive scenarios.
Hardware Specification Yes The proposed method was trained using 64 NVIDIA H800-80G GPUs.
Software Dependencies No The paper mentions using 'Wan2.1-I2V-14B as the foundational video diffusion model' but does not specify versions for general software dependencies like programming languages, libraries, or operating systems.
Experiment Setup Yes The model is trained using a constant learning rate of 2e 5, incorporating a warm-up strategy, and optimized using the Adam W optimizer. During training, we only fine-tuned the audio cross-attention layer and adapter while keeping other layers frozen. ... In stage 1 of the training process, the batch size was set to 64, whereas in stage 2, the batch size was adjusted to 32.