Multi-timescale Representation Learning in LSTM Language Models
Authors: Shivangi Mahto, Vy Ai Vo, Javier S. Turek, Alexander Huth
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments then showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution. Further, we found that explicitly imposing the theoretical distribution upon the model during training yielded better language model perplexity overall, with particular improvements for predicting low-frequency (rare) words. Moreover, the explicit multi-timescale model selectively routes information about different types of words through units with different timescales, potentially improving model interpretability. These results demonstrate the importance of careful, theoretically-motivated analysis of memory and timescale in language models. |
| Researcher Affiliation | Collaboration | Shivangi Mahto Vy A. Vo Department of Computer Science Brain-Inspired Computing Lab The University of Texas at Austin Intel Labs Austin, TX, USA Hillsboro, OR, USA shivangi@utexas.edu vy.vo@intel.com Javier S. Turek Alexander G. Huth Brain-Inspired Computing Lab Depts. of Computer Science & Neuroscience Intel Labs The University of Texas at Austin Hillsboro, OR, USA Austin, TX, USA javier.turek@intel.com huth@cs.utexas.edu Current affiliation: Apple Inc. |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps for a method in a code-like format. |
| Open Source Code | Yes | All models were implemented in pytorch (Paszke et al., 2019) and the code can be downloaded from https://github.com/HuthLab/multi-timescale-LSTM-LMs. |
| Open Datasets | Yes | We experimentally evaluated LSTM language models trained on the Penn Treebank (PTB) (Marcus et al., 1999; Mikolov et al., 2011) and Wiki Text-2 (WT2) (Merity et al., 2017) datasets. |
| Dataset Splits | Yes | PTB contains a vocabulary of 10K unique words, with 930K tokens in the training, 200K in validation, and 82K in test data. WT2 is a larger dataset with a vocabulary size of 33K unique words, almost 2M tokens in the training set, 220K in the validation set, and 240K in the test set. We randomly sampled from the grammar with probabilities p1 = 0.25, p2 = 0.25, and q = 0.25 to generate training, validation, and test sets with 10K, 2K, and 5K sequences. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, memory, or specific cloud computing instances. |
| Software Dependencies | Yes | We retrained the models using the legacy version of pytorch (0.4.1) used to train the LSTM in the original paper. |
| Experiment Setup | Yes | Both models comprise three LSTM layers with 1150 units in the first two layers and 400 units in the third layer, with an embedding size of 400. Input and output embeddings were tied. All models were trained using SGD followed by non-monotonically triggered ASGD for 1000 epochs. Training sequences were of length 70 with a probability of 0.95 and 35 with a probability of 0.05. For training, all embedding weights were uniformly initialized in the interval [ 0.1, 0.1]. All weights and biases of the LSTM layers in the baseline language model were uniformly initialized between 1/sqrt(H) where H is the output size of the respective layer. For the Dyck-2 task: Each model consists of a 256-unit LSTM layer followed by a linear output layer. Training minimized the mean squared error (MSE) loss using the Adam optimizer (Kingma & Ba, 2015) with learning rate 1e-4, β1 = 0.9, β2 = 0.999, and ε = 1e-8 for 2000 epochs. |