AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
Authors: Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, sheng zhao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution). |
| Researcher Affiliation | Collaboration | Yuancheng Wang12 , Zeqian Ju1, Xu Tan1, Lei He1, Zhizheng Wu2, Jiang Bian1, Sheng Zhao1 1Microsoft, 2The Chinese University of Hong Kong, Shenzhen |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Demo samples are available at https://audit-demopage.github.io/. This is a demo page and not a direct link to the source code for the methodology or an explicit statement of code release. |
| Open Datasets | Yes | The datasets used in our work consist of Audio Caps [22], Audio Set [12], FSD50K [11], and ESC50 [41]. |
| Dataset Splits | No | The paper states "We use a total of about 0.6M triplet data to train our audio editing model." but does not provide specific training/validation/test split information or explicitly mention a validation set for reproduction. |
| Hardware Specification | Yes | Our models are trained on 8 NVIDIA V100 GPUs for 500K steps with a batch size of 2 on each device. |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software components or libraries (e.g., PyTorch, Python, CUDA versions). |
| Experiment Setup | Yes | We train our autoencoder model with a batch size of 32 (8 per device) on 8 NVIDIA V100 GPUs for a total of 50000 steps with a learning rate of 7.5e 5. For both audio editing and U-Net audio generative diffusion, we train with a batch size of 8 on 8 NVIDIA V100 GPUs for a total of 500000 steps with a learning rate of 5e 5. Both the autoencoder and diffusion models use Adam W[29] as the optimizer with (β1, β2) = (0.9, 0.999) and weight decay of 1e 2. |