SynthNet: Learning to Synthesize Music End-to-End

Authors: Florin Schimbinschi, Christian Walder, Sarah M. Erfani, James Bailey

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare exact replicas of the architectures described in [Van Den Oord et al., 2016; Arik et al., 2017] with our proposed architecture Synth Net. ... For the purpose of validating our hypotheses, we chose to eliminate extra sources of error by manually upsampling the midi files. ... We compare the performance and quality of these two baselines against Synth Net initially in Table 3 over three sets of hyperparmeters (Table 2). For the best resulting models we perform MOS listening tests, shown in Table 5.
Researcher Affiliation Collaboration Florin Schimbinschi1 , Christian Walder2 , Sarah M. Erfani1 and James Bailey1 1The University of Melbourne 2Data61 CSIRO, Australian National University
Pseudocode No No explicit pseudocode or algorithm blocks were found.
Open Source Code Yes All are implemented in Py Torch, available at https://github.com/florinsch/synthnet .
Open Datasets Yes We generate the dataset using the freely available Timidity++1 software synthesizer. ... We used labeled audio recordings of real audio performances from the Music Net dataset [Thickstun et al., 2016].
Dataset Splits Yes After synthesizing the audio, we have approximately 12 minutes of audio for each timbre, of which 9 minutes (75%) is used for training and 3 minutes (25%) for validation.
Hardware Specification Yes All training is done on Tesla P100 GPUs with 16GB of memory.
Software Dependencies No The paper mentions 'All are implemented in Py Torch' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We use the Adam [Kingma and Ba, 2014] optimizer with a batch size of 1, a learning rate of 10 3, β1 = 0.9, β2 = 0.999 and ε = 10 8 with a weight decay of 10 5. We find that for most instruments 100-150 epochs is enough for generating high quality audio, however we keep training up to 200 epochs to observe any unexpected behaviour or overfitting. All training is done on Tesla P100 GPUs with 16GB of memory.