Multiplicative LSTM for sequence modelling

Authors: Ben Krause, Iain Murray, Steve Renals, Liang Lu

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that m LSTM outperforms standard LSTM and its deep variants for a range of character level modelling tasks, and that this improvement increases with the complexity of the task. This model achieves a test error of 1.19 bits/character on the last 4 million characters of the Hutter prize dataset when combined with dynamic evaluation.
Researcher Affiliation Academia Ben Krause, Iain Murray & Steve Renals School of Informatics, University of Edinburgh Edinburgh, Scotland, UK {ben.krause,i.murray,s.renals}@ed.ac.uk Liang Lu Toyota Technological Institute at Chicago Chicago, Illinois, USA {llu}@ttic.edu
Pseudocode No The paper contains mathematical equations describing the model, but no structured pseudocode or algorithm blocks are present.
Open Source Code Yes Code to replicative our large scale experiments on the Hutter prize dataset is available at https://github.com/benkrause/m LSTM.
Open Datasets Yes We used the Penn Treebank dataset (Marcus et al., 1993) to test small scale language modelling, the processed and raw versions of the Wikipedia text8 dataset (Hutter, 2012) to test large scale language modelling and byte level language modelling respectively, and the European parliament dataset (Koehn, 2005) to investigate multilingual fitting.
Dataset Splits Yes The first 90 million characters used for training, the next 5 million used for validation, and the final 5 million used for testing. Each dataset was split 90-5-5 for training, validation, and testing.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models, memory, or specific computing environments.
Software Dependencies No The paper mentions using 'a variant of RMSprop' but does not specify any software libraries, frameworks, or programming languages with their respective version numbers.
Experiment Setup Yes Gradient computation in these experiments used truncated backpropagation through time on sequences of length 100, only resetting the hidden state every 10 000 timesteps to allow networks access to information far in the past. All experiments used a variant of RMSprop, (Tieleman & Hinton, 2012), with normalized updates in place of a learning rate. We fitted an m LSTM with 700 hidden units to the Penn Treebank dataset, with no regularization other than early stopping. We trained an m LSTM with hidden dimensionality of 1900 on the text8 dataset. All experiments were run for 4 epochs.