Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ETTA: Elucidating the Design Space of Text-to-Audio Models

Authors: Sang-Gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To Audio (ETTA). When evaluated on Audio Caps and Music Caps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA s improved ability to generate creative audio following complex and imaginative captions a task that is more challenging than current benchmarks1. Our experiments thoroughly evaluate our framework ETTA on benchmark datasets (Audio Caps and Music Caps). We start with a systematic comparison to elucidate the design space of TTA in four major aspects: 1) training data, 2) training objectives, 3) architectural design and model sizes, and 4) sampling methods. Furthermore, we show ETTA s improved ability to generate creative audio following complex and imaginative captions, a task that is more challenging than current benchmarks.
Researcher Affiliation Industry 1NVIDIA. Correspondence to: Sang-gil Lee <EMAIL>, Zhifeng Kong <EMAIL>, Rafael Valle <EMAIL>.
Pseudocode No The paper describes its methodology in prose and mathematical formulations in Appendix B, but does not contain any explicitly labeled pseudocode or algorithm blocks in the main text or appendices.
Open Source Code Yes Code: https://github.com/NVIDIA/elucidated-text-to-audio
Open Datasets Yes With this strategy, we are able to generate 1.35M high-quality captions using audio from Audio Caps (Kim et al., 2019), Audio Set (Gemmeke et al., 2017), VGGSound (Chen et al., 2020), Wav Caps (Mei et al., 2024), and Laion-630K (Wu et al., 2023). 3 We name our synthetic dataset AF-Synthetic.
Dataset Splits Yes Our experiments thoroughly evaluate our framework ETTA on benchmark datasets (Audio Caps and Music Caps). We train ETTA on four different training datasets to assess TTA quality: Audio Caps (50K captions), AF-Audio Set (161K captions), Tango Prompt Bank (1.21M captions), and our AF-Synthetic (1.35M captions). We then fine-tune ETTA on the Audio Caps training set (FT-AC) for 50k and 100k additional steps. We use FDP, KLS, and CLM on Music Caps as a summary (Full results in Tables 13 and 14, Appendix D). We report the results using the best combination according to Table 27.
Hardware Specification Yes We train all models using 8 A100 GPUs.
Software Dependencies Yes We conduct our experiments based on the stable-audio-tools library,4 which provides the most recent practices in building TTA models.4https://github.com/Stability-AI/ stable-audio-tools commit id: 7311840. We use BF16 mixed-precision training (Micikevicius et al., 2017) and flash-attention 2 (Dao et al., 2022) to maximize training throughput.
Experiment Setup Yes We train the Audio-VAE using Adam W (Loshchilov, 2017) with a peak learning rate of 1.5 10 4 with exponential decay for 2.8M steps, with a total batch size of 64 with 1.5 seconds per sample. We train with full precision (FP32) to make the waveform compression model as accurate as possible. The latent dimension is 64 and the frame rate is 21.5 Hz. Next, we train a text-conditional latent generative model for TTA synthesis. The latent model can be either a diffusion model (Ho et al., 2020; Song et al., 2021; Salimans & Ho, 2022) or a flow matching model (Lipman et al., 2022; Tong et al., 2023). We parameterize our model using the Diffusion Transformer (Di T) (Peebles & Xie, 2023) architecture based on Evans et al. (2024c) and Lan et al. (2024), with 24 layers, 24 heads, and a width of 1536 as the default choices. We condition our model on the outputs of the T5-base (Raffel et al., 2020) text encoder, which outputs embeddings for variable-length text. Our final model is trained for 1M steps using Adam W with a peak learning rate of 10 4 with exponential decay and total batch size of 128 with 10 seconds per sample. For ablation studies, we train each model for 250k steps unless otherwise stated. We use BF16 mixed-precision training (Micikevicius et al., 2017) and flash-attention 2 (Dao et al., 2022) to maximize training throughput. For diffusion models, following (Evans et al., 2024c) we use the dpmpp-3m-sde sampler 8 and CFG scale wcfg = 7. For OT-CFM models, we compare between Euler and 2nd-order Heun samplers and draw Pareto curves for each method with respect to the number of function evaluations (NFE) and CFG scale. After this extensive sweep, we choose Euler sampling with NFE = 100, wcfg = 3.5 for main results, and wcfg = 1 (no classifier-free guidance) for ablation studies unless otherwise stated.