Creative Text-to-Audio Generation via Synthesizer Programming

Authors: Manuel Cherep, Nikhil Singh, Jessica Shand

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show how CTAG produces sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models, positioning it as a valuable and complementary tool. Extensive experiments evaluating different approaches to solving this problem, varying optimization algorithms, sound durations, and synthesis architectures.
Researcher Affiliation Academia 1Media Lab, Massachusetts Institute of Technology, Cambridge MA, USA.
Pseudocode Yes Algorithm 1 Our optimization procedure for producing sounds in CTAG.
Open Source Code Yes We will open-source our approach, both to provide a tool for novices and experts alike to realize their ideas, as well as to provoke future audio generation paradigms that recognize abstraction as an important factor for creative expression. 1ctag.media.mit.edu
Open Datasets Yes We evaluate on two well-known datasets. The first is ESC-50, a 50-class canonical environmental sound classification dataset (Piczak, 2015). The second is a subset of Audio Set (Gemmeke et al., 2017);
Dataset Splits Yes We tuned for 50 trials on the ESC-10 dataset, a subset of ESC-50 (Piczak, 2015).
Hardware Specification Yes In Table 4 we illustrate the optimization times, in seconds, for different numbers of iterations (rows) and optimizer population sizes (columns) below, on a modest GPU, i.e. single V100.
Software Dependencies No The paper mentions software components like SYNTHAX, JAX, Evosax, and LAION-CLAP, along with their respective citations, but does not provide specific version numbers for these software dependencies (e.g., 'Evosax version X.Y.Z').
Experiment Setup Yes We specifically use the Voice synthesizer architecture... It consists of 78 parameters... All parameters are initialized uniformly, θi U(0, 1). In all our experiments, the synthesizer has a control rate of 480 Hz and the audio is generated in batches at a sample rate of 48 k Hz. We experimented with several non-gradient optimization algorithms... For each algorithm, we first tuned hyperparameters using Bayesian optimization... The optimization procedure is specified in Algorithm 1. ...LES optimizer, 2-second sounds, and the Voice architecture. We conducted a full hyperparameter tuning run with 50 trials of all ESC-50 prompts to obtain the final optimization hyperparameters.