Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Fugatto 1: Foundational Generative Audio Transformer Opus 1

Authors: Rafael Valle, Rohan Badlani, Zhifeng Kong, Sang-gil Lee, Arushi Goel, Sungwon Kim, Joao Santos, Shuqi Dai, Siddharth Gururani, Aya Aljafari, Alexander Liu, Kevin Shih, Ryan Prenger, Wei Ping, Chao-Han Huck Yang, Bryan Catanzaro

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations across a diverse set of tasks demonstrate that Fugatto performs competitively with specialized models, while Composable ART enhances its sonic palette and control over synthesis. Most notably, we highlight emergent tasks and properties that surface in our framework s sonic phenomena that transcend conventional audio generation unlocking new creative possibilities. Demo Website.
Researcher Affiliation Industry NVIDIA Rafael Valle, Rohan Badlani, Zhifeng Kong, Sang-gil Lee, Arushi Goel, Sungwon Kim, Jo ao Felipe Santos, Shuqi Dai, Siddharth Gururani, Aya Al Ja fari, Alexander H. Liu, Kevin Shih, Ryan Prenger, Wei Ping, Chao-Han Huck Yang, Bryan Catanzaro EMAIL
Pseudocode Yes Algorithm 1 Optimal Transport Conditional Flow Matching Loss Pseudo-Algorithm
Open Source Code No We plan to release our synthetic captions, instructions and code to facilitate reproducible research.
Open Datasets Yes With these pillars established and leveraging open source datasets, we are able to build a large text and audio dataset with at least 20 million rows, not including on-the-fly modifications to captions, instructions and audio. Assuming each row refers to 10 seconds of audio, our dataset is comprised of at least 50,000 hours of audio. We provide a full list of datasets, tasks, and instructions in Appendices A.1.2, A.1.3 and A.1.4 respectively.
Dataset Splits Yes For TTS, we follow the evaluation in Wang et al. (2023), using the same transcripts as Eskimez et al. (2024), to evaluate our model s ability to perform speech synthesis given a transcript and a speech sample from an unseen speaker. ... Text-To-Audio (TTA): We showcase Fugatto s performance on traditional TTA benchmarks that measure a model s ability to synthesize general sounds (Audio CAPS) and music (Music CAPS) that follow instructions provided in text. We use the metrics (FD, FAD, and IS) and data splits (train, test) used in Kong et al. (2024b).
Hardware Specification Yes During the first phase, Fugatto is trained on at least 32 NVIDIA A100 GPU for approximately 1M iterations with template-based instructions and a subset of tasks.
Software Dependencies No The paper mentions several software components like T5 tokenizer, Optimal Transport Conditional Flow Matching (OT-CFM), AdamW optimizer, G2P model, Praat, Pedalboard, and Big VGAN V2 vocoder. However, specific version numbers are not provided for these software dependencies, only citations to their original papers or general descriptions.
Experiment Setup Yes We provide a list of Fugatto s hyperparameters in Table 10. ... We use the Adam W optimizer (Loshchilov & Hutter, 2019) with a learning rate of 1e-4, annealing the learning rate to 1e-6 during the second phase. ... During inference, we generate mel-spectrograms using 50 function evaluations, 100 in practice, with Heun s Solver and task-specific guidance scale γ.