Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control
Authors: Natasha Jaques, Shixiang Gu, Dzmitry Bahdanau, José Miguel Hernández-Lobato, Richard E. Turner, Douglas Eck
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of the approach is demonstrated on two applications; 1) generating novel musical melodies, and 2) computational molecular generation. For both problems, we show that the proposed method improves the desired properties and structure of the generated sequences, while maintaining information learned from data. |
| Researcher Affiliation | Collaboration | 1Google Brain, Mountain View, USA 2Massachusetts Institute of Technology, Cambridge, USA 3University of Cambridge, Cambridge, UK 4Max Planck Institute for Intelligent Systems, Stuttgart, Germany 5Universit e de Montr eal, Montr eal, Canada. |
| Pseudocode | No | Not found. |
| Open Source Code | Yes | The code for Sequence Tutor, including a checkpointed version of the trained melody RNN is available at redacted for anonymous submission. |
| Open Datasets | Yes | To train the model, we begin by extracting monophonic melodies from a corpus of 30,000 MIDI songs and encoding them as one-hot sequences of notes1. 1More information about both the note encoding and the reward metrics is available in the supplementary material. |
| Dataset Splits | No | The trained RNN eventually obtained a validation accuracy of 92% and a log perplexity score of .2536. |
| Hardware Specification | No | Not found. |
| Software Dependencies | No | Optimization was performed with Adam (Kingma & Ba, 2014)... To optimize for these metrics... we constructed a reward function that incentivizes validity, log P, SA, and QED using an open-source library called RDkit (http://www.rdkit.org/). |
| Experiment Setup | Yes | Optimization was performed with Adam (Kingma & Ba, 2014), a batch size of 128, initial learning rate of .5, and a stepwise learning rate decay of 0.85 every 1000 steps. Gradients were clipped to ensure the L2 norm was less than 5, and weight regularization was applied with β = 2.5 10 5. The Sequence Tutor model was trained using a similar configuration to the one above, except with a batch size of 32, and a reward discount factor of γ=.5. The Target Q-network s weights θ were gradually updated towards those of the Q-network (θ) according to the formula (1 η)θ + ηθ, where η = .01 is the Target-Q-network update rate. For this experiment, we also made use of prioritized experience replay (Schaul et al., 2015) to allow the model to more frequently learn from relatively rare valid samples. A value of c = 2.85 led to a higher yield of valid molecules with high metrics, but still encouraged the diversity of generated samples. |