Stable Recurrent Models

Authors: John Miller, Moritz Hardt

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we conduct a thorough investigation of stable recurrent models. Theoretically, we prove stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Empirically, we demonstrate stable recurrent models often perform as well as their unstable counterparts on benchmark sequence tasks.
Researcher Affiliation Academia John Miller & Moritz Hardt University of California, Berkeley {millerjohn,hardt}@berkeley.edu
Pseudocode No The paper describes mathematical formulations and training procedures but does not include structured pseudocode or algorithm blocks.
Open Source Code No The word-level language modeling code is based on https://github.com/pytorch/examples/ tree/master/word_language_model, the character-level code is based on https://github. com/salesforce/awd-lstm-lm, and the polyphonic music modeling code is based on https:// github.com/locuslab/TCN.
Open Datasets Yes For character-level language modeling, we train and evaluate models on Penn Treebank (Marcus et al., 1993). To increase the coverage of our experiments, we train and evaluate the word-level language models on the Wikitext-2 dataset, which is twice as large as Penn Treebank and features a larger vocabulary (Merity et al., 2017). We evaluate our models on JSB Chorales, a polyphonic music dataset consisting of 382 harmonized chorales by J.S. Bach (Allan & Williams, 2005). We use the Airline Travel Information Systems (ATIS) benchmark and report the F1 score for each model (Price, 1990).
Dataset Splits No Performance is evaluated on the held-out test set.
Hardware Specification No This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814 and a generous grant from the AWS Cloud Credits for Research program.
Software Dependencies No Although the above description is somewhat complicated, the implementation boils down to normalizing the rows of the LSTM weight matrices, which can be done very efficiently in a few lines of Py Torch.
Experiment Setup Yes Table 2: Hyperparameters for all experiments Model RNN LSTM Word LM Number layers 1 1 Hidden units 256 1024 Embedding size 1024 512 Dropout 0.25 0.65 Batch size 20 20 Learning rate 2.0 20. BPTT 35 35 Gradient clipping 0.25 1.0 Epochs 40 40 Char LM Number layers 1 1 Hidden units 768 1024 Embedding size 400 400 Dropout 0.1 0.1 Weight decay 1e-6 1e-6 Batch size 80 80 Learning rate 2.0 20.0 BPTT 150 150 Gradient clipping 1.0 1.0 Epochs 300 300 Polyphonic Music Number layers 1 1 Hidden units 1024 1024 Dropout 0.1 0.1 Batch size 1 1 Learning rate 0.05 2.0 Gradient clipping 5.0 5.0 Epochs 100 100 Slot-Filling Number layers 1 1 Hidden units 128 128 Embedding size 64 64 Dropout 0.5 0.5 Weight decay 1e-4 1e-4 Batch size 128 128 Learning rate 10.0 10.0 Gradient clipping 1.0 1.0 Epochs 100 100