SynthNet: Learning to Synthesize Music End-to-End
Authors: Florin Schimbinschi, Christian Walder, Sarah M. Erfani, James Bailey
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare exact replicas of the architectures described in [Van Den Oord et al., 2016; Arik et al., 2017] with our proposed architecture Synth Net. ... For the purpose of validating our hypotheses, we chose to eliminate extra sources of error by manually upsampling the midi files. ... We compare the performance and quality of these two baselines against Synth Net initially in Table 3 over three sets of hyperparmeters (Table 2). For the best resulting models we perform MOS listening tests, shown in Table 5. |
| Researcher Affiliation | Collaboration | Florin Schimbinschi1 , Christian Walder2 , Sarah M. Erfani1 and James Bailey1 1The University of Melbourne 2Data61 CSIRO, Australian National University |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | All are implemented in Py Torch, available at https://github.com/florinsch/synthnet . |
| Open Datasets | Yes | We generate the dataset using the freely available Timidity++1 software synthesizer. ... We used labeled audio recordings of real audio performances from the Music Net dataset [Thickstun et al., 2016]. |
| Dataset Splits | Yes | After synthesizing the audio, we have approximately 12 minutes of audio for each timbre, of which 9 minutes (75%) is used for training and 3 minutes (25%) for validation. |
| Hardware Specification | Yes | All training is done on Tesla P100 GPUs with 16GB of memory. |
| Software Dependencies | No | The paper mentions 'All are implemented in Py Torch' but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | We use the Adam [Kingma and Ba, 2014] optimizer with a batch size of 1, a learning rate of 10 3, β1 = 0.9, β2 = 0.999 and ε = 10 8 with a weight decay of 10 5. We find that for most instruments 100-150 epochs is enough for generating high quality audio, however we keep training up to 200 epochs to observe any unexpected behaviour or overfitting. All training is done on Tesla P100 GPUs with 16GB of memory. |