reproducibilityindex.ai

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Authors: Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude ( 0.1 ms to 100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with ﬁne alignment ( 3 ms) between note labels and audio waveforms.
Researcher Affiliation	Industry	Curtis Hawthorne , Andriy Stasyuk , Adam Roberts , Ian Simon , Cheng-Zhi Anna Huang , Sander Dieleman , Erich Elsen , Jesse Engel & Douglas Eck Google Brain, Deep Mind {fjord,astas,adarob,iansimon,annahuang,sedielem,eriche,jesseengel,deck}@google.com
Pseudocode	No	The paper describes methods and architectures but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that the source code for their methodology is open-source or provide a link to a code repository. It provides links for the dataset and audio examples.
Open Datasets	Yes	We make the new dataset (MIDI, audio, metadata, and train/validation/test split conﬁguration) available at https://g.co/magenta/maestro-datasetunder a Creative Commons Attribution Non-Commercial Share-Alike 4.0 license.
Dataset Splits	Yes	A train/validation/test split conﬁguration is also proposed, so that the same composition, even if performed by multiple contestants, does not appear in multiple subsets. These proportions should be true globally and also within each composer. Maintaining these proportions is not always possible because some composers have too few compositions in the dataset. The validation and test splits should contain a variety of compositions. Extremely popular compositions performed by many performers should be placed in the training split.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory, or specific computing environments) used for running experiments were provided.
Software Dependencies	No	The paper mentions software like 'Fluid Synth', 'librosa', and 'pretty midi' but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup	Yes	We trained on random crops of 2048 events and employed transposition and time compression/stretching data augmentation. The transpositions were uniformly sampled in the range of a minor third below and above the original piece. The time stretches were at discrete amounts and uniformly sampled from the set {0.95, 0.975, 1.0, 1.025, 1.05}. Our Wave Net model uses a similar autoregressive architecture to van den Oord et al. (2016), but with a larger receptive ﬁeld: 6 (instead of 3) sequential stacks with 10 residual block layers each. We found that a deeper context stack, namely 2 stacks with 6 layers each arranged in a series, worked better for this task. We also updated the model to produce 16-bit output using a mixture of logistics as described in van den Oord et al. (2018). The resulting losses after 1M training steps were 3.72, 3.70 and 3.84, respectively.