AudioGen: Textually Guided Audio Generation

Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we propose AUDIOGEN, an auto-regressive generative model that generates audio samples conditioned on text inputs. AUDIOGEN operates on a learnt discrete audio representation. ... We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AUDIOGEN outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen. ... 4 EXPERIMENTS
Researcher Affiliation Collaboration Felix Kreuk1, Gabriel Synnaeve1, Adam Polyak1, Uriel Singer1, Alexandre D efossez1, Jade Copet1, Devi Parikh1, Yaniv Taigman1, Yossi Adi1,2 1FAIR Team, Meta AI 2The Hebrew University of Jerusalem felixkreuk@meta.com
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper provides a link (https://felixkreuk.github.io/audiogen) for 'Samples' but does not explicitly state that the source code for their method is available at this link or elsewhere.
Open Datasets Yes Dataset. We use a set of several datasets: Audio Set (Gemmeke et al., 2017), BBC sound effects 1, Audio Caps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG-Sound (Chen et al., 2020), FSD50K (Fonseca et al., 2021), Free To Use Sounds 2, Sonniss Game Effects 3, We Sound Effects 4, Paramount Motion Odeon Cinematic Sound Effects 5.
Dataset Splits No The paper mentions '4k hours for training data' and evaluating on 'Audio Caps test set', but does not explicitly state the dataset splits for training, validation, and testing with specific percentages or counts.
Hardware Specification Yes The small model was trained on 64 A100 GPUs for 200k steps ( 5 days) and the large model was trained on 128 A100 GPUs for 200k steps ( 1 week).
Software Dependencies No The paper mentions using 'T5 text-encoder' and 'NLTK' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We trained two sets of ALMs, one with 285M parameters (base) and the other with 1B parameters (large). In the smaller model we use a hidden-size of 768, 24 layers and 16 attention-heads, while for the large variant we use a hidden size 1280, 36 layers and 20 attention-heads. We use the Adam optimizer with a batch size of 256, a learning rate of 5e-4 and 3k steps of warm-up followed by inverse-square root decay. ... For sampling, we employ top-p (Holtzman et al., 2019) sampling with p = 0.25. For the CFG we use a γ = 3.0.