Parallelizing Legendre Memory Unit Training
Authors: Narsimha Reddy Chilkuri, Chris Eliasmith
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the improved accuracy of our new architecture compared to the original LMU and a variety of published LSTM and transformer networks across seven benchmarks. For instance, our LMU sets a new state-of-the-art result on ps MNIST, and uses half the parameters while outperforming Distil BERT and LSTM models on IMDB sentiment analysis. In the following experiments we compare our model against the LMU, LSTMs and transformers. With these experiments, we focus on benchmarking rather than establishing new state-of-the-art results. |
| Researcher Affiliation | Collaboration | Narsimha Chilkuri 1 Chris Eliasmith 1 2 1Center for Theoretical Neuroscience, University of Waterloo 2Applied Brain Research. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code for the methodology described, nor does it explicitly state that the code is being released or is available. |
| Open Datasets | Yes | The paper uses several publicly available datasets: 'ps MNIST dataset', 'Mackey-Glass equations', 'IMDB dataset (Maas et al., 2011)', 'Quora Question Pairs (QQP)', 'Stanford Natural Language Inference (SNLI)', 'Amazon Reviews dataset (Ni et al., 2019)', and 'text8 dataset'. |
| Dataset Splits | Yes | For ps MNIST: 'We use the standard 50k/10k/10k split.' For QQP: 'experiment on two train/dev/test splits: 390k/8k/8k like in Shen et al. (2018), and 280k/80k/40k like in Sharma et al. (2019).' For SNLI: 'We use the standard 550k/10k/10k split'. For text8: 'we use the first 90MB as the training set, the next 5MB as the validation set and the final 5MB as the test set.' For IWSLT 15 En-Vi: 'We use the TED tst2012 as the validation set and TED tst2013 as the test set.' |
| Hardware Specification | Yes | Figure 1 caption: 'All results were measured using a single GTX 1080.' Section 4.3 Semi-Supervised: 'train our model for about 12 hours on a single GPU.' |
| Software Dependencies | No | The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'Keras website' but does not specify any software libraries or frameworks with their version numbers (e.g., Python, TensorFlow, PyTorch, CUDA versions). |
| Experiment Setup | Yes | The paper states 'we stick to simple architectures, constrain ourselves to train all the models, with the exception of text8, using the Adam optimizer (Kingma & Ba, 2014) with all the default parameter settings.' For Mackey-Glass: 'All the models contain about 18k parameters and are run for 500 epochs.' For IMDB, QQP, SNLI: 'We use 300D Glove embeddings (840B Common Crawl; Pennington et al. (2014)) for all our models.' For text8: 'we found it helpful to reduce the learning rate by a factor of 10 halfway into training.' |