Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning What Matters: Masked Latent Adaptation for Text-to-Audio-Video Generation

Authors: Jiyang Zheng, Siqi Pan, Yu Yao, Zhaoqing Wang, Dadong Wang, Tongliang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, our method achieves state-of-the-art performance on standard T2AV benchmarks, demonstrating significant improvements in audiovisual synchronization and semantic consistency. We conduct experiments on two benchmark datasets: VGGSound [5] and Audio Caps [22].
Researcher Affiliation Collaboration 1Sydney AI Center, The University of Sydney 2CSIRO, Data61 3Dolby Laboratories EMAIL {siqi.pan}@dolby.com {dadong.wang}@data61.csiro.au EMAIL
Pseudocode No The paper describes the methodology in prose and figures (Figure 3) but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Answer: [No] Justification: In accordance to the company policy.
Open Datasets Yes We conduct experiments on two benchmark datasets: VGGSound [5] and Audio Caps [22].
Dataset Splits Yes Following the protocol in [52], we sample 5k and 3K clips from the train and test split, respectively, and annotate them with text prompts using Video Blip [54], as adopted in [33]. Audio Caps consists of 46K audio clips paired with human-written captions sourced from Audio Set, and serves as a standard benchmark for audio-language grounding. We also sample 5K paired clips from the training split. To facilitate alignment learning and fine-tuning, we merge the training sets of both datasets, and perform evaluation separately on each test set.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for the experiments. While the NeurIPS checklist indicates that sufficient information on computer resources was provided, specific model numbers or types are absent from the paper's content.
Software Dependencies No The paper mentions models like Cog Video X1.5, Audio LDM, CLAP encoder, and LoRA, which are frameworks or components. However, it does not specify software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch, CUDA, etc.).
Experiment Setup Yes The latent dimensionality of aligned embeddings for audio generation is fixed at 512. Each generated sample has a duration of 10 seconds, with video rendered at 16 frames per second and audio sampled at 48 k Hz. Our adapter and masking modules are implemented as multilayer perceptrons. For the masking mechanism, we evaluate both soft masks (sigmoid outputs as weights) and hard masks, obtained by thresholding at 0.5. The loss weights λ1 and λ2 are empirically set to 5 and 0.1, respectively.