Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Audio-Sync Video Generation with Multi-Stream Temporal Control
Authors: Shuchen Weng, Haojie Zheng, zheng chang, Si Li, Boxin Shi, Xinlong Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment. The paper includes a dedicated '5 Experiments' section discussing quantitative comparisons, ablation studies, and user studies. |
| Researcher Affiliation | Academia | The authors are affiliated with: 1Beijing Academy of Artificial Intelligence 2School of Software and Microelectronics, Peking University 3School of Artificial Intelligence, Beijing University of Posts and Telecommunications 4State Key Lab of Multimedia Info. Processing, School of Computer Science, Peking University 5Nat l Eng. Research Ctr. of Visual Tech., School of Computer Science, Peking University. Beijing Academy of Artificial Intelligence is a non-profit research institution, and Peking University and Beijing University of Posts and Telecommunications are academic institutions. All affiliations are academic or public research institutions. |
| Pseudocode | No | The paper describes the methodology using textual explanations and diagrams (e.g., Fig. 2), but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | In the NeurIPS Paper Checklist, for Question 5 'Does the paper provide open access to the data and code...', the answer is '[No]' with the justification: 'Both the dataset and the codes will be released upon acceptance.' |
| Open Datasets | No | The paper introduces a new dataset called DEMIX, which is the primary dataset used for training and evaluation. In the NeurIPS Paper Checklist, for Question 5 'Does the paper provide open access to the data and code...', the answer is '[No]' with the justification: 'Both the dataset and the codes will be released upon acceptance.' This indicates that the main dataset is not yet publicly available. |
| Dataset Splits | Yes | For comprehensive evaluation, we hold out 1K video clips from the dataset to form the testing set. The paper also describes subset divisions for multi-stage training: 'The basic face subset comprises all talking head videos. The remaining cinematic and film-related videos are then categorized to form the other subsets: assignment to single character or multiple characters depends on the annotated human count, while assignment to sound event or visual mood occurs if the respective effects or music track is non-silent.' And 'After data collection and filtering, our DEMIX dataset includes 18K basic face, 54K single character, 39K multiple characters, 166K sound event, and 195K visual mood data...' |
| Hardware Specification | Yes | In 'Training details': 'For each stage, we train our model for 40K steps on 24 NVIDIA A800 GPUs... For inference, our model requires 280s to generate a 49-frame audio-sync video on a NVIDIA A100 GPU.' |
| Software Dependencies | No | The paper mentions several software tools and models like 'Py Scene Detect', 'Audiobox-aesthetics', 'MVSEP', 'Spleeter', 'YOLO', 'Scribe', 'Talk Net', 'wav2vec', 'CLIP', and 'BEATs', but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | In 'Training details': 'For each stage, we train our model for 40K steps on 24 NVIDIA A800 GPUs using the Adam-based optimizer [59] with a learning rate 1 10 5, where MST-Control Net and attention layers of the backbone are trainable.' |