AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models

Authors: Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, sheng zhao

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution).
Researcher Affiliation Collaboration Yuancheng Wang12 , Zeqian Ju1, Xu Tan1, Lei He1, Zhizheng Wu2, Jiang Bian1, Sheng Zhao1 1Microsoft, 2The Chinese University of Hong Kong, Shenzhen
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Demo samples are available at https://audit-demopage.github.io/. This is a demo page and not a direct link to the source code for the methodology or an explicit statement of code release.
Open Datasets Yes The datasets used in our work consist of Audio Caps [22], Audio Set [12], FSD50K [11], and ESC50 [41].
Dataset Splits No The paper states "We use a total of about 0.6M triplet data to train our audio editing model." but does not provide specific training/validation/test split information or explicitly mention a validation set for reproduction.
Hardware Specification Yes Our models are trained on 8 NVIDIA V100 GPUs for 500K steps with a batch size of 2 on each device.
Software Dependencies No The paper does not provide specific version numbers for ancillary software components or libraries (e.g., PyTorch, Python, CUDA versions).
Experiment Setup Yes We train our autoencoder model with a batch size of 32 (8 per device) on 8 NVIDIA V100 GPUs for a total of 50000 steps with a learning rate of 7.5e 5. For both audio editing and U-Net audio generative diffusion, we train with a batch size of 8 on 8 NVIDIA V100 GPUs for a total of 500000 steps with a learning rate of 5e 5. Both the autoencoder and diffusion models use Adam W[29] as the optimizer with (β1, β2) = (0.9, 0.999) and weight decay of 1e 2.