AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Authors: Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D Plumbley
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Trained on Audio Caps with a single GPU, Audio LDM achieves stateof-the-art TTA performance compared to other open-sourced systems, measured by both objective and subjective metrics. We show the main evaluation results on the AC test set in Table 1. |
| Researcher Affiliation | Academia | 1Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guildford, UK 2Department of Electrical and Electronic Engineering, Imperial College London, London, UK. Correspondence to: Haohe Liu <haohe.liu@surrey.ac.uk>. |
| Pseudocode | No | The paper describes processes and equations but does not include any formally labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Our implementation and demos are available at https://audioldm.github.io. |
| Open Datasets | Yes | Training dataset The datasets we used in this paper includes Audio Set (AS) (Gemmeke et al., 2017), Audio Caps (AC) (Kim et al., 2019), Freesound (FS)1, and BBC Sound Effect library (SFX)2. |
| Dataset Splits | No | The paper states the total number of audio samples used for training and describes how evaluation sets were generated (e.g., 'randomly selecting one of them as text condition' from AC, 'randomly select 10% audio samples from AS as another evaluation set'), but it does not provide specific percentages or absolute counts for training, validation, and test splits for the overall experimental setup. |
| Hardware Specification | Yes | Then, we train Audio LDM-S and Audio LDM-L for 0.6M steps on a single GPU, NVIDIA RTX 3090, with the batch size of 5 and 8, respectively. The Audio LDM-L-Full is trained for 1.5M steps on one NVIDIA A100 with a batch size of 8. |
| Software Dependencies | No | The paper mentions using components like Hi Fi-GAN and Adam optimizer, but it does not specify version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | The paper provides detailed experimental setup including compression level r=4, training steps (0.6M, 1.5M, 0.25M), batch sizes (5, 8, 96), learning rates (3e-5, 1e-5, 4.5e-6, 2e-4), noise schedule parameters (N=1000, beta1=0.0015, betaN=0.0195), DDIM sampling steps (200), guidance scale (w=2.0), VAE optimizer (Adam) and its learning rate and batch size, and vocoder settings (window, FFT, hop size, fmin, fmax, AdamW parameters, learning rate decay). |