Creative Text-to-Audio Generation via Synthesizer Programming
Authors: Manuel Cherep, Nikhil Singh, Jessica Shand
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show how CTAG produces sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models, positioning it as a valuable and complementary tool. Extensive experiments evaluating different approaches to solving this problem, varying optimization algorithms, sound durations, and synthesis architectures. |
| Researcher Affiliation | Academia | 1Media Lab, Massachusetts Institute of Technology, Cambridge MA, USA. |
| Pseudocode | Yes | Algorithm 1 Our optimization procedure for producing sounds in CTAG. |
| Open Source Code | Yes | We will open-source our approach, both to provide a tool for novices and experts alike to realize their ideas, as well as to provoke future audio generation paradigms that recognize abstraction as an important factor for creative expression. 1ctag.media.mit.edu |
| Open Datasets | Yes | We evaluate on two well-known datasets. The first is ESC-50, a 50-class canonical environmental sound classification dataset (Piczak, 2015). The second is a subset of Audio Set (Gemmeke et al., 2017); |
| Dataset Splits | Yes | We tuned for 50 trials on the ESC-10 dataset, a subset of ESC-50 (Piczak, 2015). |
| Hardware Specification | Yes | In Table 4 we illustrate the optimization times, in seconds, for different numbers of iterations (rows) and optimizer population sizes (columns) below, on a modest GPU, i.e. single V100. |
| Software Dependencies | No | The paper mentions software components like SYNTHAX, JAX, Evosax, and LAION-CLAP, along with their respective citations, but does not provide specific version numbers for these software dependencies (e.g., 'Evosax version X.Y.Z'). |
| Experiment Setup | Yes | We specifically use the Voice synthesizer architecture... It consists of 78 parameters... All parameters are initialized uniformly, θi U(0, 1). In all our experiments, the synthesizer has a control rate of 480 Hz and the audio is generated in batches at a sample rate of 48 k Hz. We experimented with several non-gradient optimization algorithms... For each algorithm, we first tuned hyperparameters using Bayesian optimization... The optimization procedure is specified in Algorithm 1. ...LES optimizer, 2-second sounds, and the Voice architecture. We conducted a full hyperparameter tuning run with 50 trials of all ESC-50 prompts to obtain the final optimization hyperparameters. |