Tuning Recurrent Neural Networks with Reinforcement Learning
Authors: Natasha Jaques, Shixiang Gu, Richard E. Turner, Douglas Eck
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore the usefulness of our approach in the context of music generation. An LSTM is trained on a large corpus of songs to predict the next note in a musical sequence. This Note-RNN is then refined using our method and rules of music theory. We show that by combining maximum likelihood (ML) and RL in this way, we can not only produce more pleasing melodies, but significantly reduce unwanted behaviors and failure modes of the RNN, while maintaining information learned from data. Table 1 provides quantitative results in the form of performance on the music theory rules to which we trained the model to adhere... To answer it, we conducted a user study via Amazon Mechanical Turk in which participants were asked to rate which of two randomly selected melodies they preferred on a Likert scale. |
| Researcher Affiliation | Collaboration | Natasha Jaques12, Shixiang Gu134, Richard E. Turner3, Douglas Eck1 1Google Brain, USA 2Massachusetts Institute of Technology, USA 3University of Cambridge, UK 4Max Planck Institute for Intelligent Systems, Germany |
| Pseudocode | No | The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | All of the code for RL Tuner, including a checkpointed version of the trained Note RNN is available at https://github.com/natashamjaques/magenta/tree/rl-tuner. |
| Open Datasets | No | To train the Note RNN, we extract monophonic melodies from a corpus of 30,000 MIDI songs. No specific access information (link, DOI, formal citation with authors/year) is provided for this dataset. |
| Dataset Splits | No | The paper mentions 'validation accuracy' but does not provide specific details on how the dataset was split into training, validation, and test sets (e.g., percentages, sample counts, or a clear splitting methodology). |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions the Adam optimizer but does not specify any software dependencies (libraries, frameworks) with version numbers that would be required to reproduce the experiments. |
| Experiment Setup | Yes | The Note RNN consists of one LSTM layer of 100 cells, and was trained for 30,000 iterations with a batch size of 128. Optimization was performed with Adam (Kingma & Ba, 2014), and gradients were clipped to ensure the L2 norm was less than 5. The learning rate was initially set to .5, and a momentum of 0.85 was used to exponentially decay the learning rate every 1000 steps. To regularize the network, a penalty of β = 2.5 10 5 was applied to the L2 norm of the network weights. Finally, the losses for the first 8 notes of each sequence were not used to train the model... Each RL Tuner model was trained for 1,000,000 iterations, using the Adam optimizer, a batch size of 32, and clipping gradients in the same way. The reward discount factor was γ=.5. The Target-Q-network s weights θ were gradually updated to be similar to those of the Q-network (θ) according to the formula (1 η)θ + ηθ, where η = .01 is the Target-Q-network update rate. We replicated our results for a number of settings for the weight placed on the music-theory rewards, c; we present results for c=.5 below because we believe them to be most musically pleasing. Similarly, we replicated the results using both ϵ-greedy and Boltzmann exploration, and present the results using ϵ-greedy exploration below. |