Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Make An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with No Modality Left Behind , for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Both subjective and objective evaluations demonstrate that Make-An-Audio achieves new state-of-the-art in text-to-audio with natural and controllable synthesis. Make-An-Audio exhibits superior audio quality and text-audio alignment faithfulness on the benchmark Audio Caption dataset and even generalizes well to the unsupervised Clotho dataset in a zero-shot fashion. |
| Researcher Affiliation | Collaboration | *Equal contribution 1Zhejiang University 2Peking University 3Speech & Audio Team, Byte Dance AI Lab. Correspondence to: Zhou Zhao <Zhao Zhou@zju.edu.cn>. |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly presented as figures or labeled sections in the paper. |
| Open Source Code | No | The paper provides a link for "Audio samples" but not for the source code of the methodology itself. "Audio samples are available at https:// Make-An-Audio.github.io" |
| Open Datasets | Yes | We train on a combination of several datasets: Audio Set, BBC sound effects, Audiostock, Audio Caps-train, ESC-50, FSD50K, Free To Use Sounds, Sonniss Game Effects, We Sound Effects, MACS, Epidemic Sound, Urban Sound8K, Wav Text5Ks, Libri Speech, and Medley-solos-DB. For audios without natural language annotation, we apply the pseudo prompt enhancement to construct captions aligned well with the audio. Overall we have 3k hours with 1M audio-text pairs for training data. |
| Dataset Splits | Yes | For evaluating text-to-audio models (Yang et al., 2022; Kreuk et al., 2022), the Audio Caption validation set is adopted as the standard benchmark, which contains 494 samples with five human-annotated captions in each audio clip. For a more challenging zero-shot scenario, we also provide results in Clotho (Drossos et al., 2020) validation set which contain multiple audio events. |
| Hardware Specification | Yes | For our main experiments, we train a U-Net (Ronneberger et al., 2015) based text-conditional diffusion model, which is optimized using 18 NVIDIA V100 GPU until 2M optimization steps. |
| Software Dependencies | No | The paper mentions using Hi Fi-GAN (Kong et al., 2020) (V1) but does not provide version numbers for other software components like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | The base learning rate is set to 0.005, and we scale it by the number of GPUs and the batch size following LDM. We utilize Hi Fi-GAN (Kong et al., 2020) (V1) trained on VGGSound dataset (Chen et al., 2020a) as the vocoder to synthesize waveform from the generated mel-spectrogram in all our experiments. Hyperparameters are included in Appendix B. |