Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ThinkSound: Chain-of-Thought Reasoning in Multimodal LLMs for Audio Generation and Editing

Authors: Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Think Sound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and Co T metrics, and excels in the out-of-distribution Movie Gen Audio benchmark. The project page is available at https://Think Sound-Project.github.io.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology (HKUST) 2Tongyi Fun Team, Alibaba Group 3Zhejiang University
Pseudocode No The paper describes the method using textual descriptions and architectural diagrams (Figure 2, Figure 3) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes we will open-source our codes(contains model or their access method), which is enough to reproduce our experiments. and We provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results.
Open Datasets Yes The Audio Co T dataset comprises both video-audio and audio-text pairs. For video-audio data, we utilize VGGSound (Chen et al., 2020) and a curated, non-speech subset of Audio Set (Gemmeke et al., 2017) to ensure broad coverage of real-world audiovisual events. For audio-text data, we aggregate pairs from Audio Set-SL (Hershey et al., 2021), Freesound (Fonseca et al., 2017), Audio Caps (Kim et al., 2019), and BBC Sound Effects 2, resulting in a diverse and representative corpus for training multimodal models.
Dataset Splits Yes For the video-to-audio generation task, we use the VGGSound test set as the in-distribution evaluation set while the Movie Gen Audio Bench is the out-of-distribution evaluation set. For the VGGSound test set, we use the same quality filtering protocol as our training data preparation. Given that our primary focus is on video-to-sound/music generation, we construct three different difficulty levels based on the complexity of the audio-visual relationships. ... For each difficulty level, we construct a benchmark subset containing around 2000 samples. ... Each evaluation subset contains approximately 2,000 samples.
Hardware Specification Yes We employ mixed precision training with a batch size of 144 across 24 A800 GPUs for 500,000 steps. ... In the foundation model training phase, we utilize an exponential moving average and automatic mixed precision for 100,000 steps on 8 A100 GPUs, with an effective batch size of 256. ... During the task-specific fine-tuning stage, we similarly apply exponential moving average and automatic mixed precision for 50,000 steps on 8 A100 GPUs, maintaining an effective batch size of 256.
Software Dependencies No The paper mentions several models and frameworks used (e.g., Video LLaMA2, GPT-4.1-nano, Qwen2-Audio, Meta CLIP, T5-v1-xl, Grounded SAM2) but does not provide specific version numbers for general software dependencies like programming languages or deep learning libraries.
Experiment Setup Yes We employ mixed precision training with a batch size of 144 across 24 A800 GPUs for 500,000 steps. Subsequently, following Evans et al. (2024), we freeze the VAE encoder and train the VAE decoder with a latent mask ratio of 0.1 for an additional 500,000 steps. We use Adam W (Loshchilov & Hutter, 2019) as the optimizer, setting the generator learning rate to 3e-5 and the discriminator learning rate to 6e-5. In the foundation model training phase, we utilize an exponential moving average and automatic mixed precision for 100,000 steps on 8 A100 GPUs, with an effective batch size of 256. We adopt a cfg dropout of 0.2 for each modality with a learning rate of 1e-4. During the task-specific fine-tuning stage, we similarly apply exponential moving average and automatic mixed precision for 50,000 steps on 8 A100 GPUs, maintaining an effective batch size of 256. Adam W remains our optimizer of choice, with a learning rate set at 1e-4.