Unaligned Supervision for Automatic Music Transcription in The Wild
Authors: Ben Maman, Amit H Bermano
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using this unaligned supervision scheme, complemented by pseudolabels and pitch shift augmentation, our method enables training on in-the-wild recordings with unprecedented accuracy and instrumental variety. Using only synthetic data and unaligned supervision, we report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations. We also demonstrate robustness and ease of use; we report comparable results when training on a small, easily obtainable, self-collected dataset, and we propose alternative labeling to the Music Net dataset, which we show to be more accurate. |
| Researcher Affiliation | Academia | Ben Maman 1 Amit H. Bermano 1 1School of Computer Science, Tel Aviv University, Israel. Correspondence to: Ben Maman <benmaman@mail.tau.ac.il>, Amit H. Bermano <amberman@tauex.tau.ac.il>. |
| Pseudocode | Yes | Our method, described in pseudo-code Algorithm 1, relies on Expectation Maximization (EM) (see Section 3.1), and involves three components (see Figure 1 left): (I) Synthetic data initial training (Section 3.2), (II) aligning real recordings with separate-source MIDI (Section B.1.1), including deciding which frames to use and which not to (Section 3.3). (III) transcriber refinement, including pitch-shift equivariance augmentations (Section 3.4). |
| Open Source Code | Yes | Our project page is available at https://benadar293.github.io. Our improved annotation for Music Net, our code, together with qualitative examples for various genres and instruments, are available on our project page at https://benadar293.github.io. |
| Open Datasets | Yes | MIDI Pop Dataset (AI, 2020) is a large collection of MIDI files. Music Net (Thickstun et al., 2017), which contains 34 hours of classical western music, performed on various instruments. |
| Dataset Splits | Yes | We measure the accuracy of our labeling process on the Maestro validation dataset, for which precise annotation exists. For 46 out of the 105 pieces in the validation dataset, of total time 6:57:22, we were able to find additional unaligned MIDI (to be used instead of those offered with the dataset). |
| Hardware Specification | Yes | This took 65 hours on a pair of Nvidia Ge Force RTX 2080 Ti GPUs. For most experiments, labeling is performed twice: once after sythetic training, and once after 45 |Dataset| steps. For perspective, Music Net EM training, which includes 28K iterations and 2 DTW labelling iterations, took 16 hours on a pair of Nvidia Ge Force RTX 2080 Ti GPUs. |
| Software Dependencies | No | No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, or specific libraries) were provided, only the general tools used such as 'Adam optimizer' and 'mean BCE loss'. |
| Experiment Setup | Yes | For all our experiments, we use an architecture similar to the one proposed by Hawthorne et al. (2019). To handle instrument variety, we increase network width compared to the originally proposed architecture: we use LSTM layers of size 384, convolutional filters of size 64/64/128, and linear layers of size 1024. We re-sampled all recordings to 16k Hz sample rate, and used the log-mel spectrogram as the input representation, with 229 log-spaced bins (i.e., input dimensionality of 229). We used the mean BCE loss, with an Adam optimizer, with gradient clipped to norm 3, and batch size 8. The initial synthetic model was trained for 350K steps. Further training on real data was done for 90 |Dataset| steps. |