AudioLDM: Text-to-Audio Generation with Latent Diffusion Models

Authors: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D Plumbley

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Trained on Audio Caps with a single GPU, Audio LDM achieves stateof-the-art TTA performance compared to other open-sourced systems, measured by both objective and subjective metrics. We show the main evaluation results on the AC test set in Table 1.
Researcher Affiliation Academia 1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK 2Department of Electrical and Electronic Engineering, Imperial College London, London, UK. Correspondence to: Haohe Liu <haohe.liu@surrey.ac.uk>.
Pseudocode No The paper describes processes and equations but does not include any formally labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our implementation and demos are available at https://audioldm.github.io.
Open Datasets Yes Training dataset The datasets we used in this paper includes Audio Set (AS) (Gemmeke et al., 2017), Audio Caps (AC) (Kim et al., 2019), Freesound (FS)1, and BBC Sound Effect library (SFX)2.
Dataset Splits No The paper states the total number of audio samples used for training and describes how evaluation sets were generated (e.g., 'randomly selecting one of them as text condition' from AC, 'randomly select 10% audio samples from AS as another evaluation set'), but it does not provide specific percentages or absolute counts for training, validation, and test splits for the overall experimental setup.
Hardware Specification Yes Then, we train Audio LDM-S and Audio LDM-L for 0.6M steps on a single GPU, NVIDIA RTX 3090, with the batch size of 5 and 8, respectively. The Audio LDM-L-Full is trained for 1.5M steps on one NVIDIA A100 with a batch size of 8.
Software Dependencies No The paper mentions using components like Hi Fi-GAN and Adam optimizer, but it does not specify version numbers for these or other software dependencies required for replication.
Experiment Setup Yes The paper provides detailed experimental setup including compression level r=4, training steps (0.6M, 1.5M, 0.25M), batch sizes (5, 8, 96), learning rates (3e-5, 1e-5, 4.5e-6, 2e-4), noise schedule parameters (N=1000, beta1=0.0015, betaN=0.0195), DDIM sampling steps (200), guidance scale (w=2.0), VAE optimizer (Adam) and its learning rate and batch size, and vocoder settings (window, FFT, hop size, fmin, fmax, AdamW parameters, learning rate decay).