Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset
Authors: Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude ( 0.1 ms to 100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment ( 3 ms) between note labels and audio waveforms. |
| Researcher Affiliation | Industry | Curtis Hawthorne , Andriy Stasyuk , Adam Roberts , Ian Simon , Cheng-Zhi Anna Huang , Sander Dieleman , Erich Elsen , Jesse Engel & Douglas Eck Google Brain, Deep Mind {fjord,astas,adarob,iansimon,annahuang,sedielem,eriche,jesseengel,deck}@google.com |
| Pseudocode | No | The paper describes methods and architectures but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository. It provides links for the dataset and audio examples. |
| Open Datasets | Yes | We make the new dataset (MIDI, audio, metadata, and train/validation/test split configuration) available at https://g.co/magenta/maestro-datasetunder a Creative Commons Attribution Non-Commercial Share-Alike 4.0 license. |
| Dataset Splits | Yes | A train/validation/test split configuration is also proposed, so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. These proportions should be true globally and also within each composer. Maintaining these proportions is not always possible because some composers have too few compositions in the dataset. The validation and test splits should contain a variety of compositions. Extremely popular compositions performed by many performers should be placed in the training split. |
| Hardware Specification | No | No specific hardware details (like GPU/CPU models, memory, or specific computing environments) used for running experiments were provided. |
| Software Dependencies | No | The paper mentions software like 'Fluid Synth', 'librosa', and 'pretty midi' but does not provide specific version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | We trained on random crops of 2048 events and employed transposition and time compression/stretching data augmentation. The transpositions were uniformly sampled in the range of a minor third below and above the original piece. The time stretches were at discrete amounts and uniformly sampled from the set {0.95, 0.975, 1.0, 1.025, 1.05}. Our Wave Net model uses a similar autoregressive architecture to van den Oord et al. (2016), but with a larger receptive field: 6 (instead of 3) sequential stacks with 10 residual block layers each. We found that a deeper context stack, namely 2 stacks with 6 layers each arranged in a series, worked better for this task. We also updated the model to produce 16-bit output using a mixture of logistics as described in van den Oord et al. (2018). The resulting losses after 1M training steps were 3.72, 3.70 and 3.84, respectively. |