Encoding Musical Style with Transformer Autoencoders

Authors: Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, Jesse Engel

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a You Tube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines. We evaluate our model on two datasets: the publicly-available MAESTRO (Hawthorne et al., 2019) dataset, and a You Tube dataset of piano performances transcribed from 10,000+ hours of audio (Simon et al., 2019). We validate this notion of perceptual similarity through quantitative analyses based on note-based features of performances as well as qualitative user listening studies and interpolations. As shown in Tables 3 and 4, the performance autoencoder generates samples that have 48% higher similarity to the conditioning input as compared to the unconditional baseline for the You Tube dataset (45% higher similarity for MAESTRO).
Researcher Affiliation Collaboration Kristy Choi 1 * Curtis Hawthorne 2 Ian Simon 2 Monica Dinculescu 2 Jesse Engel 2 1Department of Computer Science, Stanford University *Work completed during an internship at Google Brain 2Google Brain.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It includes a flowchart in Figure 1 and describes processes in paragraph text.
Open Source Code Yes We provide open-sourced implementations in Tensorflow (Abadi et al., 2016) at https://goo.gl/magenta/music-transformerautoencoder-code.
Open Datasets Yes Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a You Tube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines. Empirically, we evaluate our model on two datasets: the publicly-available MAESTRO (Hawthorne et al., 2019) dataset, and a You Tube dataset of piano performances transcribed from 10,000+ hours of audio (Simon et al., 2019).
Dataset Splits Yes Datasets: We used both the MAESTRO (Hawthorne et al., 2019) and You Tube datasets (Simon et al., 2019) for the experimental setup. We used the standard 80/10/10 train/validation/test split from MAESTRO v1.0.0, and augmented the dataset by 10x using pitch shifts of no more than a minor third and time stretches of at most 5%.
Hardware Specification No The paper mentions "GPU training" and "TPU training" but does not provide specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications.
Software Dependencies No We implemented the model in the Tensor2Tensor framework (Vaswani et al., 2017), and used the default hyperparameters for training: 0.2 learning rate with 8000 warmup steps, rsqrt decay, 0.2 dropout, and early stopping for GPU training. We provide open-sourced implementations in Tensorflow (Abadi et al., 2016) at https://goo.gl/magenta/music-transformerautoencoder-code. The paper mentions Tensor2Tensor and TensorFlow but does not specify their version numbers.
Experiment Setup Yes We implemented the model in the Tensor2Tensor framework (Vaswani et al., 2017), and used the default hyperparameters for training: 0.2 learning rate with 8000 warmup steps, rsqrt decay, 0.2 dropout, and early stopping for GPU training. For TPU training, we use Ada Factor with the rsqrt decay and learning rate warmup steps to be 10K. We adopt many of the hyperparameter configurations from (Huang et al., 2019b), where we reduce the query and key hidden size to half the hidden size, use 8 hidden layers, use 384 hidden units, and set the maximum relative distance to consider to half the training sequence length for relative global attention. We set the maximum sequence length (length of event-based representations) to be 2048 tokens, and a filter size of 1024.