Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Authors: Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study.
Researcher Affiliation	Academia	Michael Ungersböck ETH Zurich EMAIL Florian Grötschla ETH Zurich EMAIL Luca A. Lanzendörfer ETH Zurich EMAIL June Young Yi Seoul National University EMAIL Changho Choi Korea University EMAIL Roger Wattenhofer ETH Zurich EMAIL
Pseudocode	No	The paper describes methods and pipelines in figures and text but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	To encourage future research, we release our code and model weights. ... We provide all our code and model weights online: https://github.com/eth-disco/sao-instruct.
Open Datasets	Yes	For the input captions, we use the captioning datasets Audio Caps [22] and Wav Caps [35]. Audio Caps contains 50k human-written captions paired with audio clips sourced from Audio Set [12]. Wav Caps consists of 400k audio-caption pairs collected from multiple sources, including Free Sound 1, Audio Set-SL [15] and the BBC Sound Effects 2 library. ... ZETA baseline [33] uses Stable Audio Open as the underlying generative model, which was trained on Free Sound and the Free Music Archive (FMA) [6]. ... Audio Editor [20], which uses Auffusion [46] as its underlying model trained on several audio datasets, including Audio Caps [22], Wav Caps [35], and MACS [34].
Dataset Splits	Yes	For Prompt-to-Prompt, input captions are taken from the Audio Caps [22] training split and a random subset of Free Sound from Wav Caps [35]. ... We evaluate on 1k 10-second samples from the Audio Caps test subset, where edit instructions and output captions are generated using the approach described in Section 3.1. ... To improve prompt adherence of Stable Audio Open, we fine-tune the model on both the training and validation splits of Audio Caps [22] using a total of 47k samples.
Hardware Specification	Yes	While the dataset was generated using various GPUs, we report the average inference times based on a single NVIDIA A6000 GPU for consistency. ... The model is trained for 15 epochs with a batch size of 64 on two NVIDIA A100 GPUs for 30 hours. ... We train the diffusion transformer on two NVIDIA A6000 GPUs with a batch size of 16 for 4 epochs.
Software Dependencies	No	The paper mentions specific LLMs like GPT-4o and GPT-4.1 mini, and an Adam W optimizer, but does not provide specific version numbers for general software dependencies like Python, PyTorch, or other libraries.
Experiment Setup	Yes	Specifically, the Adam W [31] optimizer is configured with a learning rate of 5e 5, (β1, β2) = (0.9, 0.999), and weight decay of 1e 3. ... The model is trained for 15 epochs with a batch size of 64 on two NVIDIA A100 GPUs for 30 hours. ... Unless specified otherwise, we use 100 denoising steps and a CFG value of 5. ... We train the diffusion transformer on two NVIDIA A6000 GPUs with a batch size of 16 for 4 epochs. Models trained on the individual datasets (50k samples each) were trained for 30 hours, while the final model on the large combined dataset (150k samples) was trained for 80 hours.