MT3: Multi-Task Multitrack Music Transcription

Authors: Joshua P Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, Jesse Engel

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show this unified training framework achieves high-quality transcription results across a range of datasets, dramatically improving performance for low-resource instruments (such as guitar), while preserving strong performance for abundant instruments (such as piano). 4 EXPERIMENTS
Researcher Affiliation Collaboration Josh Gardner , Ian Simon, Ethan Manilow , Curtis Hawthorne, Jesse Engel Google Research, Brain Team Paul G. Allen School of Computer Science & Engineering. Work performed as a Google Research intern. Interactive Audio Lab, Northwestern University. Work performed as a Google Student Researcher.
Pseudocode No The paper describes the tokenization scheme and model architecture but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Additionally, we make our code available along with the release of this paper at https://github.com/magenta/mt3. In conjunction with the release of this work, we will make our model code, along with the code we used to replicate prior baseline models, available at https://github.com/magenta/mt3.
Open Datasets Yes MAESTROv3 (Hawthorne et al., 2018) is collected from a virtual classical piano competition, where audio and detailed MIDI data are collected from performers playing on Disklavier pianos that electronically capture the performance of each note in real time. ... More detailed statistics on the MAESTRO dataset are available at https://magenta.tensorflow.org/datasets/maestro. The Synthesized Lakh Dataset (Slakh, or Slakh2100) (Manilow et al., 2019), is a dataset constructed by creating high-quality renderings of 2100 files from Lakh MIDI using professional-quality virtual instruments. Guitar Set (Xi et al., 2018) is composed of live guitar performances of varied genre, tempo, and style, recorded using a high-precision hexaphonic pickup that individually captures the sound of each guitar string. ... https://Guitar Set.weebly.com/ Music Net (Thickstun et al., 2016) consists of 330 recordings of classical music with MIDI annotations. ... https://homes.cs.washington.edu/ thickstn/musicnet.html The University of Rochester Multi-Modal Music Performance (URMP) Dataset (Li et al., 2018) consists of audio, video, and MIDI annotation of multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. ... http://www2.ece.rochester.edu/projects/air/projects/URMP.html
Dataset Splits Yes MAESTRO includes a standard train/validation/test split, which ensures that the same composition does not appear in multiple subsets. 962 performances are in the train set, 137 are in the validation set, and 177 are in the test set. We use the standard Slakh2100 train/validation/test splits for all experiments. We use the Slakh2100 train/test/validation split to separate Cerberus4 tracks. The training set contains 960 tracks with 418.13 hours of audio, the test set contains 132 tracks with 46.1 hours of audio, and the validation set contains 235 tracks with 78.4 hours of audio. There is no official train-test split for Guitar Set. We establish the following split: for every style, we use the first two progressions for train and the final for validation. For convenience, we provide the exact train/validation split for tracks as part of the open-source release for this paper. This split produces 478 tracks for training, and 238 for validation. We perform our own random split of the dataset into train/validation/test sets, and we provide the exact track IDs in each split in our open-source code release for this paper. We use the following pieces for validation: 1, 2, 12, 13, 24, 25, 31, 38, 39. The remaining pieces are used for training. This validation split reserves two duets; two trios; three quartets; and two quintets for the validation set, ensuring a diverse instrumentation in the validation split.
Hardware Specification No The paper mentions the model size (T5 small, 60 million parameters) but does not provide specific hardware details like exact GPU/CPU models or types of machines used for training or inference.
Software Dependencies Yes We use the T5 small model architecture described in Raffel et al. (2019), with the modifications defined in the T5.1.1 recipe5. This is a standard Transformer architecture, and we use the implementation available in t5x6, which is built on FLAX (Heek et al., 2020) and JAX (Bradbury et al., 2020). (Heek et al., 2020) lists FLAX as 'Version 0.3'.
Experiment Setup Yes All mixture models are trained for 1M steps using a fixed learning rate of 0.001. The dataset-specific models are trained for 219 steps, as these models tended to converge much faster, particularly on the smaller datasets.