Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
Authors: Yuancheng Wang, Zeqian Ju, Xu Tan, Lei He, Zhizheng Wu, Jiang Bian, sheng zhao
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | AUDIT achieves state-of-the-art results in both objective and subjective metrics for several audio editing tasks (e.g., adding, dropping, replacement, inpainting, super-resolution). |
| Researcher Affiliation | Collaboration | Yuancheng Wang12 , Zeqian Ju1, Xu Tan1, Lei He1, Zhizheng Wu2, Jiang Bian1, Sheng Zhao1 1Microsoft, 2The Chinese University of Hong Kong, Shenzhen |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Demo samples are available at https://audit-demopage.github.io/. This is a demo page and not a direct link to the source code for the methodology or an explicit statement of code release. |
| Open Datasets | Yes | The datasets used in our work consist of Audio Caps [22], Audio Set [12], FSD50K [11], and ESC50 [41]. |
| Dataset Splits | No | The paper states "We use a total of about 0.6M triplet data to train our audio editing model." but does not provide specific training/validation/test split information or explicitly mention a validation set for reproduction. |
| Hardware Specification | Yes | Our models are trained on 8 NVIDIA V100 GPUs for 500K steps with a batch size of 2 on each device. |
| Software Dependencies | No | The paper does not provide specific version numbers for ancillary software components or libraries (e.g., PyTorch, Python, CUDA versions). |
| Experiment Setup | Yes | We train our autoencoder model with a batch size of 32 (8 per device) on 8 NVIDIA V100 GPUs for a total of 50000 steps with a learning rate of 7.5e 5. For both audio editing and U-Net audio generative diffusion, we train with a batch size of 8 on 8 NVIDIA V100 GPUs for a total of 500000 steps with a learning rate of 5e 5. Both the autoencoder and diffusion models use Adam W[29] as the optimizer with (β1, β2) = (0.9, 0.999) and weight decay of 1e 2. |