SING: Symbol-to-Instrument Neural Generator

Authors: Alexandre Defossez, Neil Zeghidour, Nicolas Usunier, Leon Bottou, Francis Bach

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we study the more computationally efficient alternative of generating the waveform frame-by-frame with large strides. We present SING, a lightweight neural audio synthesizer for the original task of generating musical notes given desired instrument, pitch and velocity. Our model is trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms. On the generalization task of synthesizing notes for pairs of pitch and instrument not seen during training, SING produces audio with significantly improved perceptual quality compared to a state-of-the-art autoencoder based on Wave Net [4] as measured by a Mean Opinion Score (MOS), and is about 32 times faster for training and 2, 500 times faster for inference.
Researcher Affiliation Collaboration Alexandre Défossez Facebook AI Research INRIA / ENS PSL Research University Paris, France defossez@fb.com Neil Zeghidour Facebook AI Research LSCP / ENS / EHESS / CNRS INRIA / PSL Research University Paris, France neilz@fb.com Nicolas Usunier Facebook AI Research Paris, France usunier@fb.com Léon Bottou Facebook AI Research New York, USA leonb@fb.com Francis Bach INRIA École Normale Supérieure PSL Research University francis.bach@ens.fr
Pseudocode No The paper provides a diagram of the architecture (Figure 1) but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The source code for SING and a pretrained model are available on our github3. Audio samples are available on the article webpage4. 3https://github.com/facebookresearch/SING
Open Datasets Yes The train set from the NSynth dataset [4] is composed of 289,205 audio recordings of instruments, some synthetic and some acoustic.
Dataset Splits No We did not make use of the validation or test set from the original NSynth dataset because the instruments had no overlap with the training set. Because we use a look-up table for the instrument embedding, we cannot generate audio for unseen instruments. Instead, we selected for each instrument 10% of the pitches randomly that we moved to a separate test set.
Hardware Specification Yes All the models are trained on 4 P100 GPUs using Adam [15] with a learning rate of 0.0003 and a batch size of 256.
Software Dependencies No Our approach trains and generate waveforms comparably fast with a Py Torch2 implementation. The paper mentions PyTorch and Adam optimizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes All the models are trained on 4 P100 GPUs using Adam [15] with a learning rate of 0.0003 and a batch size of 256. We train the auto-encoder for 50 epochs which takes about 12 hours on 4 GPUs. The LSTM is trained for 50 epochs using truncated backpropagation through time [26] using a sequence length of 32. This takes about 10 hours on 4 GPUs. We do so for 20 epochs which takes about 8 hours on 4 GPUs.