Regularizing and Optimizing LSTM Language Models
Authors: Stephen Merity, Nitish Shirish Keskar, Richard Socher
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using these and other regularization strategies, our Av SGD Weight-Dropped LSTM (AWD-LSTM) achieves state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on Wiki Text-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on Wiki Text-2. We also explore the viability of the proposed regularization and optimization strategies in the context of the quasi-recurrent neural network (QRNN) and demonstrate comparable performance to the AWD-LSTM counterpart. |
| Researcher Affiliation | Industry | Stephen Merity, Nitish Shirish Keskar & Richard Socher Salesforce Research Palo Alto, CA 94301, USA {smerity,nkeskar,rsocher}@salesforce.com |
| Pseudocode | Yes | Algorithm 1 Non-monotonically Triggered Av SGD (NT-Av SGD) |
| Open Source Code | Yes | The code for reproducing the results is open sourced and is available at https://github.com/salesforce/ awd-lstm-lm. |
| Open Datasets | Yes | For evaluating the impact of these approaches, we perform language modeling over a preprocessed version of the Penn Treebank (PTB) (Mikolov et al., 2010) and the Wiki Text-2 (WT2) data set (Merity et al., 2016). |
| Dataset Splits | Yes | Along the same lines, one could make a triggering decision based on the performance of the model on the validation set. However, instead of averaging immediately after the validation metric worsens, we propose a non-monotonic criterion that conservatively triggers the averaging when the validation metric fails to improve for multiple cycles; see Algorithm 1. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments, only mentioning compatible libraries like NVIDIA cu DNN. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., Python version, library versions like PyTorch or TensorFlow), which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | All experiments use a three-layer LSTM model with 1150 units in the hidden layer and an embedding of size 400. ... For training the models, we use the NT-Av SGD algorithm discussed in the previous section for 750 epochs with L equivalent to one epoch and n = 5. We use a batch size of 80 for WT2 and 40 for PTB. ... We carry out gradient clipping with maximum norm 0.25 and use an initial learning rate of 30 for all experiments. We use a random BPTT length which is N(70, 5) with probability 0.95 and N(35, 5) with probability 0.05. The values used for dropout on the word vectors, the output between LSTM layers, the output of the final LSTM layer, and embedding dropout where (0.4, 0.3, 0.4, 0.1) respectively. For the weight-dropped LSTM, a dropout of 0.5 was applied to the recurrent weight matrices. For WT2, we increase the input dropout to 0.65 to account for the increased vocabulary size. For all experiments, we use AR and TAR values of 2 and 1 respectively, and tie the embedding and softmax weights. |