Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Make An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with No Modality Left Behind , for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Both subjective and objective evaluations demonstrate that Make-An-Audio achieves new state-of-the-art in text-to-audio with natural and controllable synthesis. Make-An-Audio exhibits superior audio quality and text-audio alignment faithfulness on the benchmark Audio Caption dataset and even generalizes well to the unsupervised Clotho dataset in a zero-shot fashion.
Researcher Affiliation Collaboration *Equal contribution 1Zhejiang University 2Peking University 3Speech & Audio Team, Byte Dance AI Lab. Correspondence to: Zhou Zhao <Zhao Zhou@zju.edu.cn>.
Pseudocode No No pseudocode or algorithm blocks are explicitly presented as figures or labeled sections in the paper.
Open Source Code No The paper provides a link for "Audio samples" but not for the source code of the methodology itself. "Audio samples are available at https:// Make-An-Audio.github.io"
Open Datasets Yes We train on a combination of several datasets: Audio Set, BBC sound effects, Audiostock, Audio Caps-train, ESC-50, FSD50K, Free To Use Sounds, Sonniss Game Effects, We Sound Effects, MACS, Epidemic Sound, Urban Sound8K, Wav Text5Ks, Libri Speech, and Medley-solos-DB. For audios without natural language annotation, we apply the pseudo prompt enhancement to construct captions aligned well with the audio. Overall we have 3k hours with 1M audio-text pairs for training data.
Dataset Splits Yes For evaluating text-to-audio models (Yang et al., 2022; Kreuk et al., 2022), the Audio Caption validation set is adopted as the standard benchmark, which contains 494 samples with five human-annotated captions in each audio clip. For a more challenging zero-shot scenario, we also provide results in Clotho (Drossos et al., 2020) validation set which contain multiple audio events.
Hardware Specification Yes For our main experiments, we train a U-Net (Ronneberger et al., 2015) based text-conditional diffusion model, which is optimized using 18 NVIDIA V100 GPU until 2M optimization steps.
Software Dependencies No The paper mentions using Hi Fi-GAN (Kong et al., 2020) (V1) but does not provide version numbers for other software components like Python, PyTorch, or CUDA.
Experiment Setup Yes The base learning rate is set to 0.005, and we scale it by the number of GPUs and the batch size following LDM. We utilize Hi Fi-GAN (Kong et al., 2020) (V1) trained on VGGSound dataset (Chen et al., 2020a) as the vocoder to synthesize waveform from the generated mel-spectrogram in all our experiments. Hyperparameters are included in Appendix B.