Prompt-guided Precise Audio Editing with Diffusion Models

Authors: Manjie Xu, Chenxing Li, Duzhen Zhang, Dan Su, Wei Liang, Dong Yu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4. Experiments Experimental results highlight the effectiveness of our method in various editing tasks. We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023). For objective metrics, we leverage commonly used metrics to evaluate the editing effects. We leverage Fr echet distance (FD), Fr echet audio distance (FAD), Spectral distance (SD), and Kullback Leibler (KL) divergence to measure the distance between the edited audio and the ground truth.
Researcher Affiliation Collaboration Work done while Manjie Xu and Duzhen Zhang were interns at Tencent. 1Beijing Institute of Technology 2Tencent AI Lab Beijing 3Tencent AI Lab Seattle. Correspondence to: Chenxing Li <lichenxing007@gmail.com>, Wei Liang <liangwei@bit.edu.cn>, Dong Yu <dongyu@ieee.org>.
Pseudocode Yes Algorithm 1 PPAE
Open Source Code Yes See the project page at https://sites.google.com/ view/icml24-ppae.
Open Datasets Yes We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023).
Dataset Splits No We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023). A pivotal aspect of precise audio editing is implementing precise modifications while maintaining the other elements of the audio unchanged. In each task, we select two distinct audio clips, treating one as the target for editing. For each task, we randomly sample 100 editing tasks as the test set.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using specific models like Tango, AudioLDM, and Make-An-Audio, but does not provide specific ancillary software details like library or framework versions (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes In this work, we primarily utilize Tango (Ghosal et al., 2023) as our TTA backbone model due to its success in TTA generation, while it s worth mentioning that our methods can be applied to a wide range of popular diffusion models. We run our experiments with 100 inference steps and retain the original hyperparameters from Tango. For editing, we run the denoising steps with 0.8 cross-replace steps, 0.0 self-replace steps, and 50 skip steps. The bootstrapping num n is set to 5. We reset our Fuser configs to fit these settings, mainly ηmin and ηmax, ts, and te.