Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ZeroSep: Separate Anything in Audio with Zero Training

Authors: Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments 4.1 Experimental Settings Baselines. To evaluate our training-free diffusion-based separation method, we compare it against two categories of existing approaches: (i) Training-based methods. Datasets. We evaluate the open-set separation capabilities of our training-free method on two benchmark multimodal datasets with paired audio and text labels: The Audio Visual Event (AVE) [Tian et al., 2018] dataset contains 4,143 video clips, each 10 seconds long, covering 28 distinct sound categories (e.g., church bell, barking, frying). The MUSIC dataset [Zhao et al., 2018] consists of clean solo performances from 11 musical instruments, thereby offering a controlled environment to assess the separation of individual, isolated sources with minimal interference. Evaluation Metrics. Traditional separation metrics Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ra tio (SIR), and Signal-to-Artifact Ratio (SAR) [Raffel et al., 2014] quantify sample-level differences between a separated output ˆs and the ground truth s. 4.2 Main Comparison Tab. 1 presents the core results of our evaluation, comparing the performance of our training-free method, Zero Sep, against representative training-based and other training-free baselines on the AVE and MUSIC datasets. 4.3 Ablation Studies In this section, we analyze the influence of various components on Zero Sep s separation performance to identify factors contributing to its effectiveness.
Researcher Affiliation Collaboration Chao Huang1, Yuesheng Ma2, Junxuan Huang3, Susan Liang1, Yunlong Tang1, Jing Bi1, Wenqiang Liu3, Nima Mesgarani2, Chenliang Xu1 1University of Rochester, 2Columbia University, 3Tencent America
Pseudocode No The paper describes methods using mathematical equations and textual descriptions, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No Our project page is here: https://wikichao.github.io/Zero Sep/. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The code will be released upon acceptance.
Open Datasets Yes We evaluate the open-set separation capabilities of our training-free method on two benchmark multimodal datasets with paired audio and text labels: The Audio Visual Event (AVE) [Tian et al., 2018] dataset contains 4,143 video clips, each 10 seconds long, covering 28 distinct sound categories (e.g., church bell, barking, frying). The MUSIC dataset [Zhao et al., 2018] consists of clean solo performances from 11 musical instruments, thereby offering a controlled environment to assess the separation of individual, isolated sources with minimal interference. Audio Set [Gemmeke et al., 2017]
Dataset Splits Yes To facilitate comparison with prior research and ensure reproducibility, we use the official separation data splits for both AVE and MUSIC as provided by the DAVIS repository [Huang et al., 2024a].
Hardware Specification Yes Table 7: Runtime complexity and average cost (separating one source) on A100 GPU.
Software Dependencies No The paper does not specify version numbers for any software dependencies used in the experiments.
Experiment Setup Yes The Crucial Role of Guidance Weight ω: A key discovery is that achieving separation hinges on setting the classifier-free guidance weight ω appropriately, specifically ω = 1... We empirically find that ω = 1 yields the best separation results (as shown in Fig. 3(a)). This finding reveals that controlling the balance between conditional and unconditional predictions via ω is critical for steering the diffusion process from generation towards faithful separation. We compare different sizes within the Audio LDM and Audio LDM2 families (e.g., Audio LDM-S vs. Audio LDM-L, Audio LDM2-S vs. Audio LDM2-L). As shown in Tab. 3, increasing model size consistently leads to improved separation performance. Results on MUSIC are shown in Tab. 6. (i) Low-to-high schedules (0 → 1, sine) degrade separation, as early underconditioning causes loss of clean target structure. (ii) High-to-low schedules (1 → 0) improve over constant ω = 1, consistent with reports that guidance is most useful in the early-to-mid noise range but less helpful at the end. This supports our main claim that ω = 1 is an effective default while also motivating dynamic scheduling as a promising future direction.