Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
Authors: Hila Manor, Tomer Michaeli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we explore two approaches for zero-shot audio editing with pre-trained audio DDMs, one based on text guidance and the other based on semantic perturbations that are found in an unsupervised manner. Our zero-shot text-guided audio (ZETA) editing technique allows a wide range of manipulations, from changing the style or genre of a musical piece to changing specific instruments in the arrangement (Fig. 1(c),(d)), all while maintaining high perceptual quality and semantic similarity to the source signal. Our zero-shot unsupervised (ZEUS) technique allows generating e.g., interesting variations in melody that adhere to the original key, rhythm, and style, but are impossible to achieve through text guidance (Fig. 1(a),(b)). We compare our methods to the state-of-the-art text-to-music model Music Gen (Copet et al., 2023), whose generation process can be conditioned on a given music piece, as well as to using the zero-shot editing methods SDEdit (Meng et al., 2021) and DDIM inversion (Song et al., 2021b; Dhariwal & Nichol, 2021) in conjunction with the Audio LDM2 model (Liu et al., 2023b). |
| Researcher Affiliation | Academia | 1 Faculty of Electrical and Computer Engineering, Technion Israel Institute of Technology, Haifa, Israel. Correspondence to: Hila Manor <hila.manor@campus.technion.ac.il>. |
| Pseudocode | Yes | In addition to publishing the code repository, we provide in Alg. 1 the complete algorithm for the unsupervised PC computation described here and in Sec. 3.3 for reference. Algorithm 1 Unsupervised PCs Computation |
| Open Source Code | Yes | Samples and code can be found on our examples page. |
| Open Datasets | Yes | To enable a systematic analysis and quantitative comparison to other editing methods, we use the Music Delta subset of the Medley DB dataset (Bittner et al., 2014), comprised of 34 musical excerpts in varying styles and in lengths ranging from 20 seconds to 5 minutes, and create and release with our code base a corresponding small dataset of prompts, named Medley MDPrompts. This prompts dataset includes 3-4 source prompts for each signal, and 3-12 editing target prompts for each of the source prompts, totalling 107 source prompts and 696 target prompts, all labeled manually by the authors. |
| Dataset Splits | No | The paper mentions sub-sampling and datasets but does not explicitly define specific training, validation, or test splits by percentages or exact counts for reproducibility. |
| Hardware Specification | No | The paper discusses inference speed and memory consumption on a GPU, but does not specify any particular GPU model, CPU, or other hardware specifications used for experiments. |
| Software Dependencies | Yes | For the CLAP model used in the CLAP, LPAPS, and FAD metrics calculation, as described in Sec. 4.2, we follow Gui et al. (2024) and Music Gen (Copet et al., 2023), and use the music audioset epoch 15 esc 90.14.pt checkpoint of LAION-AI (Chen et al., 2022; Wu et al., 2023). In all of our unsupervised editing experiments, we run 50 subspace iterations for extracting PCs, and set C = 10 3 as the small approximation constant as described by Manor & Michaeli (2024). We use Music Gen with their default parameters provided in their official implementation demo. |
| Experiment Setup | Yes | To evaluate our editing methods we used Audio LDM2 (Liu et al., 2023b) as the pre-trained model, using 200 inference steps as recommended by the authors. In this setting we set the CFG strength of the target prompt to 12 for SDEdit and for our method, and to 5 for DDIM inversion. We set this hyper-parameter such that the edits achieve a good balance between CLAP and LPIPS. The CFG strength for the source prompt is set to 3, as recommended by Liu et al. (2023a). |