Multiplicative LSTM for sequence modelling
Authors: Ben Krause, Iain Murray, Steve Renals, Liang Lu
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically that m LSTM outperforms standard LSTM and its deep variants for a range of character level modelling tasks, and that this improvement increases with the complexity of the task. This model achieves a test error of 1.19 bits/character on the last 4 million characters of the Hutter prize dataset when combined with dynamic evaluation. |
| Researcher Affiliation | Academia | Ben Krause, Iain Murray & Steve Renals School of Informatics, University of Edinburgh Edinburgh, Scotland, UK {ben.krause,i.murray,s.renals}@ed.ac.uk Liang Lu Toyota Technological Institute at Chicago Chicago, Illinois, USA {llu}@ttic.edu |
| Pseudocode | No | The paper contains mathematical equations describing the model, but no structured pseudocode or algorithm blocks are present. |
| Open Source Code | Yes | Code to replicative our large scale experiments on the Hutter prize dataset is available at https://github.com/benkrause/m LSTM. |
| Open Datasets | Yes | We used the Penn Treebank dataset (Marcus et al., 1993) to test small scale language modelling, the processed and raw versions of the Wikipedia text8 dataset (Hutter, 2012) to test large scale language modelling and byte level language modelling respectively, and the European parliament dataset (Koehn, 2005) to investigate multilingual fitting. |
| Dataset Splits | Yes | The first 90 million characters used for training, the next 5 million used for validation, and the final 5 million used for testing. Each dataset was split 90-5-5 for training, validation, and testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models, memory, or specific computing environments. |
| Software Dependencies | No | The paper mentions using 'a variant of RMSprop' but does not specify any software libraries, frameworks, or programming languages with their respective version numbers. |
| Experiment Setup | Yes | Gradient computation in these experiments used truncated backpropagation through time on sequences of length 100, only resetting the hidden state every 10 000 timesteps to allow networks access to information far in the past. All experiments used a variant of RMSprop, (Tieleman & Hinton, 2012), with normalized updates in place of a learning rate. We fitted an m LSTM with 700 hidden units to the Penn Treebank dataset, with no regularization other than early stopping. We trained an m LSTM with hidden dimensionality of 1900 on the text8 dataset. All experiments were run for 4 epochs. |