reproducibilityindex.ai

Creative Text-to-Audio Generation via Synthesizer Programming

Authors: Manuel Cherep, Nikhil Singh, Jessica Shand

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show how CTAG produces sounds that are distinctive, perceived as artistic, and yet similarly identifiable to recent neural audio synthesis models, positioning it as a valuable and complementary tool. Extensive experiments evaluating different approaches to solving this problem, varying optimization algorithms, sound durations, and synthesis architectures.
Researcher Affiliation	Academia	1Media Lab, Massachusetts Institute of Technology, Cambridge MA, USA.
Pseudocode	Yes	Algorithm 1 Our optimization procedure for producing sounds in CTAG.
Open Source Code	Yes	We will open-source our approach, both to provide a tool for novices and experts alike to realize their ideas, as well as to provoke future audio generation paradigms that recognize abstraction as an important factor for creative expression. 1ctag.media.mit.edu
Open Datasets	Yes	We evaluate on two well-known datasets. The first is ESC-50, a 50-class canonical environmental sound classification dataset (Piczak, 2015). The second is a subset of Audio Set (Gemmeke et al., 2017);
Dataset Splits	Yes	We tuned for 50 trials on the ESC-10 dataset, a subset of ESC-50 (Piczak, 2015).
Hardware Specification	Yes	In Table 4 we illustrate the optimization times, in seconds, for different numbers of iterations (rows) and optimizer population sizes (columns) below, on a modest GPU, i.e. single V100.
Software Dependencies	No	The paper mentions software components like SYNTHAX, JAX, Evosax, and LAION-CLAP, along with their respective citations, but does not provide specific version numbers for these software dependencies (e.g., 'Evosax version X.Y.Z').
Experiment Setup	Yes	We specifically use the Voice synthesizer architecture... It consists of 78 parameters... All parameters are initialized uniformly, θi U(0, 1). In all our experiments, the synthesizer has a control rate of 480 Hz and the audio is generated in batches at a sample rate of 48 k Hz. We experimented with several non-gradient optimization algorithms... For each algorithm, we first tuned hyperparameters using Bayesian optimization... The optimization procedure is specified in Algorithm 1. ...LES optimizer, 2-second sounds, and the Voice architecture. We conducted a full hyperparameter tuning run with 50 trials of all ESC-50 prompts to obtain the final optimization hyperparameters.