Prompt-guided Precise Audio Editing with Diffusion Models
Authors: Manjie Xu, Chenxing Li, Duzhen Zhang, Dan Su, Wei Liang, Dong Yu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experiments Experimental results highlight the effectiveness of our method in various editing tasks. We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023). For objective metrics, we leverage commonly used metrics to evaluate the editing effects. We leverage Fr echet distance (FD), Fr echet audio distance (FAD), Spectral distance (SD), and Kullback Leibler (KL) divergence to measure the distance between the edited audio and the ground truth. |
| Researcher Affiliation | Collaboration | Work done while Manjie Xu and Duzhen Zhang were interns at Tencent. 1Beijing Institute of Technology 2Tencent AI Lab Beijing 3Tencent AI Lab Seattle. Correspondence to: Chenxing Li <lichenxing007@gmail.com>, Wei Liang <liangwei@bit.edu.cn>, Dong Yu <dongyu@ieee.org>. |
| Pseudocode | Yes | Algorithm 1 PPAE |
| Open Source Code | Yes | See the project page at https://sites.google.com/ view/icml24-ppae. |
| Open Datasets | Yes | We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023). |
| Dataset Splits | No | We construct our test set utilizing a cleaned subset of the Fsd50k dataset (Fonseca et al., 2021; Li et al., 2023). A pivotal aspect of precise audio editing is implementing precise modifications while maintaining the other elements of the audio unchanged. In each task, we select two distinct audio clips, treating one as the target for editing. For each task, we randomly sample 100 editing tasks as the test set. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using specific models like Tango, AudioLDM, and Make-An-Audio, but does not provide specific ancillary software details like library or framework versions (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In this work, we primarily utilize Tango (Ghosal et al., 2023) as our TTA backbone model due to its success in TTA generation, while it s worth mentioning that our methods can be applied to a wide range of popular diffusion models. We run our experiments with 100 inference steps and retain the original hyperparameters from Tango. For editing, we run the denoising steps with 0.8 cross-replace steps, 0.0 self-replace steps, and 50 skip steps. The bootstrapping num n is set to 5. We reset our Fuser configs to fit these settings, mainly ηmin and ηmax, ts, and te. |