Reversible Recurrent Neural Networks

Authors: Matthew MacKay, Paul Vicol, Jimmy Ba, Roger B. Grosse

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of these models on language modeling and neural machine translation benchmarks. Depending on the task, dataset, and chosen architecture, reversible models (without attention) achieve 10 15-fold memory savings over traditional models. Reversible models achieve approximately equivalent performance to traditional LSTM and GRU models on word-level language modeling on the Penn Tree Bank dataset [Marcus et al., 1993] and lag 2 5 perplexity points behind traditional models on the Wiki Text-2 dataset [Merity et al., 2016]. Achieving comparable memory savings with attention-based recurrent sequence-to-sequence models is difficult, since the encoder hidden states must be kept simultaneously in memory in order to perform attention. We address this challenge by performing attention over a small subset of the hidden state, concatenated with the word embedding. With this technique, our reversible models succeed on neural machine translation tasks, outperforming baseline GRU and LSTM models on the Multi30K dataset [Elliott et al., 2016] and achieving competitive performance on the IWSLT 2016 [Cettolo et al., 2016] benchmark. Applying our technique reduces memory cost by a factor of 10 15 in the decoder, and a factor of 5 10 in the encoder.1
Researcher Affiliation Academia Matthew Mac Kay, Paul Vicol, Jimmy Ba, Roger Grosse University of Toronto Vector Institute {mmackay, pvicol, jba, rgrosse}@cs.toronto.edu
Pseudocode Yes Algorithm 1 Exactly reversible multiplication (Maclaurin et al. [2015])
Open Source Code Yes 1Code will be made available at https://github.com/matthewjmackay/reversible-rnn
Open Datasets Yes We evaluate the performance of these models on language modeling and neural machine translation benchmarks. We evaluated our oneand two-layer reversible models on word-level language modeling on the Penn Treebank [Marcus et al., 1993] and Wiki Text-2 [Merity et al., 2016] corpora. We further evaluated our models on English-to-German neural machine translation (NMT)... Multi30K [Elliott et al., 2016], a dataset of 30,000 sentence pairs derived from Flickr image captions, and IWSLT 2016 [Cettolo et al., 2016], a larger dataset of 180,000 pairs.
Dataset Splits Yes Table 1: Validation perplexities (memory savings) on Penn Tree Bank word-level language modeling. Table 2: Validation perplexities on Wiki Text-2 word-level language modeling. We include training/validation curves for all models in Appendix I.
Hardware Specification No The paper mentions running on a "parallel architecture such as a GPU" but does not provide specific hardware details like GPU model numbers, CPU models, or memory amounts used for experiments.
Software Dependencies No The paper mentions using frameworks like TensorFlow and Theano in the related work, but it does not provide specific version numbers for the software dependencies used in its own implementation.
Experiment Setup Yes We regularized the hidden-to-hidden, hidden-to-output, and input-to-hidden connections, as well as the embedding matrix, using various forms of dropout9. We used the hyperparameters from Merity et al. [2017]. Details are provided in Appendix G.1. Experimental details are provided in Appendix G.2