reproducibilityindex.ai

Encoding Musical Style with Transformer Autoencoders

Authors: Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, Jesse Engel

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a You Tube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines. We evaluate our model on two datasets: the publicly-available MAESTRO (Hawthorne et al., 2019) dataset, and a You Tube dataset of piano performances transcribed from 10,000+ hours of audio (Simon et al., 2019). We validate this notion of perceptual similarity through quantitative analyses based on note-based features of performances as well as qualitative user listening studies and interpolations. As shown in Tables 3 and 4, the performance autoencoder generates samples that have 48% higher similarity to the conditioning input as compared to the unconditional baseline for the You Tube dataset (45% higher similarity for MAESTRO).
Researcher Affiliation	Collaboration	Kristy Choi 1 * Curtis Hawthorne 2 Ian Simon 2 Monica Dinculescu 2 Jesse Engel 2 1Department of Computer Science, Stanford University *Work completed during an internship at Google Brain 2Google Brain.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It includes a flowchart in Figure 1 and describes processes in paragraph text.
Open Source Code	Yes	We provide open-sourced implementations in Tensorﬂow (Abadi et al., 2016) at https://goo.gl/magenta/music-transformerautoencoder-code.
Open Datasets	Yes	Empirically, we demonstrate the effectiveness of our method on various music generation tasks on the MAESTRO dataset and a You Tube dataset with 10,000+ hours of piano performances, where we achieve improvements in terms of log-likelihood and mean listening scores as compared to baselines. Empirically, we evaluate our model on two datasets: the publicly-available MAESTRO (Hawthorne et al., 2019) dataset, and a You Tube dataset of piano performances transcribed from 10,000+ hours of audio (Simon et al., 2019).
Dataset Splits	Yes	Datasets: We used both the MAESTRO (Hawthorne et al., 2019) and You Tube datasets (Simon et al., 2019) for the experimental setup. We used the standard 80/10/10 train/validation/test split from MAESTRO v1.0.0, and augmented the dataset by 10x using pitch shifts of no more than a minor third and time stretches of at most 5%.
Hardware Specification	No	The paper mentions "GPU training" and "TPU training" but does not provide specific hardware details such as GPU/CPU models, memory amounts, or detailed computer specifications.
Software Dependencies	No	We implemented the model in the Tensor2Tensor framework (Vaswani et al., 2017), and used the default hyperparameters for training: 0.2 learning rate with 8000 warmup steps, rsqrt decay, 0.2 dropout, and early stopping for GPU training. We provide open-sourced implementations in Tensorﬂow (Abadi et al., 2016) at https://goo.gl/magenta/music-transformerautoencoder-code. The paper mentions Tensor2Tensor and TensorFlow but does not specify their version numbers.
Experiment Setup	Yes	We implemented the model in the Tensor2Tensor framework (Vaswani et al., 2017), and used the default hyperparameters for training: 0.2 learning rate with 8000 warmup steps, rsqrt decay, 0.2 dropout, and early stopping for GPU training. For TPU training, we use Ada Factor with the rsqrt decay and learning rate warmup steps to be 10K. We adopt many of the hyperparameter conﬁgurations from (Huang et al., 2019b), where we reduce the query and key hidden size to half the hidden size, use 8 hidden layers, use 384 hidden units, and set the maximum relative distance to consider to half the training sequence length for relative global attention. We set the maximum sequence length (length of event-based representations) to be 2048 tokens, and a ﬁlter size of 1024.