Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling

Authors: Chaitanya Ahuja, Louis-Philippe Morency

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this family of new LRU models on computational convergence rates and statistical efficiency. Our experiments are performed on four publicly-available datasets, comparing with Grid-LSTM and Recurrent Highway networks. Our results show that LRU has better empirical computational convergence rates and statistical efficiency values, along with learning more accurate language models.
Researcher Affiliation Academia Chaitanya Ahuja Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 cahuja@andrew.cmu.edu; Louis-Philippe Morency Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 morency@cs.cmu.edu
Pseudocode No The paper provides mathematical formulations of the models but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Source code available at: https://github.com/chahuja/lru
Open Datasets Yes We use Penn Treebank Dataset (henceforth PTB) (Taylor, Marcus, and Santorini 2003) with pre-processing in (Mikolov et al. 2010) and the War and Peace Dataset (henceforth WP) as the standard benchmarks for character-level language modeling. Among bigger datasets, we use enwik8 and text8 from the Hutter Prize dataset (Hutter 2012).
Dataset Splits Yes Following common practice, we chose first 90% for training, next 5% for validation and last 5% for testing for all datasets.
Hardware Specification No The paper mentions using "PyTorch implementations of LSTM and GRU which are highly optimized with a C backend" but does not specify any particular hardware (e.g., GPU model, CPU, memory).
Software Dependencies No The paper mentions using "PyTorch implementations" but does not provide version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes To make the comparison fair, we fixed number of parameters to 10M and 24M... All models are either 2 or 4 layers deep, except RHNs which were trained with the transition depth of 5... Batch size was fixed to 250 and all the models are trained by backpropagating the error up till 50 time steps. We use the optimizer Adam(Kingma and Ba 2015) with an exponentially (factor of 0.9) decaying learning rate of 0.001, β1 = 0.1 and β2 = 0.001. All weights were initialized using Glorot initialization (Glorot and Bengio 2010). We optimize our models with Categorical Cross Entropy (CCE) as the loss function...