Fast Timing-Conditioned Latent Audio Diffusion

Authors: Zach Evans, Cj Carr, Josiah Taylor, Scott H. Hawley, Jordi Pons

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose: (i) a Fréchet Distance based on Open L3 embeddings (Cramer et al., 2019) to evaluate the plausibility of the generated long-form full-band stereo signals, (ii) a Kullback-Leibler divergence to evaluate the semantic correspondence between lengthy generated and reference audios up to 32k Hz, and (iii) a CLAP score to evaluate how long-form full-band stereo audios adhere to the given text prompt. We also conduct a qualitative study, assessing audio quality and text alignment, while also pioneering the assessment of musicality, stereo correctness, and musical structure. We show that Stable Audio can obtain state-of-the-art results on long-form full-band stereo music and sound effects generation from text and timing inputs.
Researcher Affiliation Collaboration 1Stability AI. 2Belmont University, work done while at Stability AI.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code to reproduce our model/metrics and demos is online1. 1Model: https://github.com/Stability-AI/stable-audio-tools. Metrics: https://github.com/Stability-AI/stable-audio-metrics.
Open Datasets Yes Our dataset consists of 806,284 audios (19,500 hours) containing music (66% or 94%)3, sound effects (25% or 5%)3, and instrument stems (9% or 1%)3, with the corresponding text metadata from the stock music provider Audio Sparx. Our dataset (audio and metadata) is online 4 for consultation. 4https://www.audiosparx.com and We rely on the standard Music Caps (Agostinelli et al., 2023) and Audio Caps (Kim et al., 2019) benchmarks.
Dataset Splits No The paper mentions using a 'training window length' and evaluating on standard benchmarks (Music Caps and Audio Caps test set), but it does not specify explicit percentages or sample counts for training, validation, and test splits for their main dataset, nor does it detail cross-validation or other splitting methodologies for reproducibility beyond citing standard benchmarks for evaluation.
Hardware Specification Yes It can render up to 95 sec (our training window length) of stereo audio at 44.1k Hz in 8 sec on an A100 GPU (40GB VRAM). and It was trained using automatic mixed precision for 1.1M steps with an effective batch size of 256 on 16 A100 GPUs.
Software Dependencies No The paper describes various architectural components and methods used (e.g., 'DPMSolver++', 'fast and memory-efficient attention implementation', 'CLAP'), and provides links to their model code and a dependent CLAP repository, but it does not list specific software libraries with version numbers (e.g., Python 3.x, PyTorch 1.x) required for full reproducibility.
Experiment Setup Yes It was trained using automatic mixed precision for 1.1M steps with an effective batch size of 256 on 16 A100 GPUs. After 460,000 steps the encoder was frozen and the decoder was fine-tuned for an additional 640,000 steps. [...] The CLAP model was trained for 100 epochs on our dataset from scratch, with an effective batch size of 6,144 with 64 A100 GPUs. [...] It was trained using exponential moving average and automatic mixed precision for 640,000 steps on 64 A100 GPUs with an effective batch size of 256. The audio was resampled to 44.1k Hz and sliced to 4,194_304 samples (95.1 sec). [...] We implemented a v-objective (Salimans & Ho, 2022) with a cosine noise schedule and continuous denoising timesteps. We apply dropout (10%) to the conditioning signals to be able to use classifier-free guidance. The text encoder is frozen while training the diffusion model. [...] Our sampling strategy during inference is based on the DPMSolver++ (Lu et al., 2022), and we use classifier-free guidance (with a scale of 6) as proposed by Lin et al. (2024). We use 100 diffusion steps during inference...