Grid Long Short-Term Memory
Authors: Nal Kalchbrenner, Alex Graves, Ivo Danihelka
ICLR 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply the model to algorithmic tasks such as 15-digit integer addition and sequence memorization, where it is able to significantly outperform the standard LSTM. We then give results for two empirical tasks. We find that 2D Grid LSTM achieves 1.47 bits per character on the Wikipedia character prediction benchmark, which is state-of-the-art among neural approaches. In addition, we use the Grid LSTM to define a novel two-dimensional translation model, the Reencoder, and show that it outperforms a phrase-based reference system on a Chinese-to-English translation task. |
| Researcher Affiliation | Industry | Nal Kalchbrenner & Ivo Danihelka & Alex Graves Google Deep Mind London, United Kingdom {nalk,danihelka,gravesa}@google.com |
| Pseudocode | No | The paper describes the architecture and computations using mathematical equations and descriptive text, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing the source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We next test the 2-LSTM network on the Hutter challenge Wikipedia dataset (Hutter, 2012). The aim is to successively predict the next character in the corpus. The dataset has 100 million characters." and "We evaluate the Grid LSTM translation model on the IWSLT BTEC Chinese-to-English corpus that consists of 44016 pairs of source and target sentences for training, 1006 for development and 503 for testing." and "The MNIST dataset consists of 50000 training images, 10000 validation images and 10000 test images. |
| Dataset Splits | Yes | We train the networks for up to 5 million samples or until they reach 100% accuracy on a random sample of 100 unseen addition problems. Note that since during training all samples are randomly generated, samples are seen only once and it is not possible for the network to overfit on training data. The training and test accuracies agree closely." (Addition)"We follow the splitting procedure of (Chung et al., 2015), where the last 5 million characters are used for testing." (Wikipedia)"The IWSLT BTEC Chinese-to-English corpus that consists of 44016 pairs of source and target sentences for training, 1006 for development and 503 for testing." (Translation)"The MNIST dataset consists of 50000 training images, 10000 validation images and 10000 test images." (MNIST) |
| Hardware Specification | Yes | Each network is trained with a maximum of 10 million samples or four days of computation on a Tesla K40m GPU. |
| Software Dependencies | No | The paper mentions optimization algorithms like 'Adam' and 'Ada Grad' but does not specify version numbers for any software, libraries, or programming languages used in the experiments. |
| Experiment Setup | Yes | We train the two types of networks with either tied or untied weights, with 400 hidden units each and with between 1 and 50 layers. We train the network with stochastic gradient descent using mini-batches of size 15 and the Adam optimizer with a learning rate of 0.001 (Kingma & Ba, 2014)." (Addition) and "We use a tied 2-LSTM with 1000 hidden units and 6 layers of depth." (Character-level LM) and "We train seven models with vectors of size 450 and apply dropout with probability 0.5 to the hidden vectors within the blocks. For the optimization we use Adam with a learning rate of 0.001. At decoding the output probabilities are averaged across the models. The beam search has size 20..." (Translation) and "The 3-LSTM with the cells uses patches of 2x2 pixels, has four LSTM layers with 100 hidden units and one Re LU layer with 4096 units... We use mini-batches of size 128 and train the models using Adam and a learning rate of 0.001." (MNIST) |