reproducibilityindex.ai

Lattice Recurrent Unit: Improving Convergence and Statistical Efficiency for Sequence Modeling

Authors: Chaitanya Ahuja, Louis-Philippe Morency

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this family of new LRU models on computational convergence rates and statistical efﬁciency. Our experiments are performed on four publicly-available datasets, comparing with Grid-LSTM and Recurrent Highway networks. Our results show that LRU has better empirical computational convergence rates and statistical efﬁciency values, along with learning more accurate language models.
Researcher Affiliation	Academia	Chaitanya Ahuja Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 cahuja@andrew.cmu.edu; Louis-Philippe Morency Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 morency@cs.cmu.edu
Pseudocode	No	The paper provides mathematical formulations of the models but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Source code available at: https://github.com/chahuja/lru
Open Datasets	Yes	We use Penn Treebank Dataset (henceforth PTB) (Taylor, Marcus, and Santorini 2003) with pre-processing in (Mikolov et al. 2010) and the War and Peace Dataset (henceforth WP) as the standard benchmarks for character-level language modeling. Among bigger datasets, we use enwik8 and text8 from the Hutter Prize dataset (Hutter 2012).
Dataset Splits	Yes	Following common practice, we chose ﬁrst 90% for training, next 5% for validation and last 5% for testing for all datasets.
Hardware Specification	No	The paper mentions using "PyTorch implementations of LSTM and GRU which are highly optimized with a C backend" but does not specify any particular hardware (e.g., GPU model, CPU, memory).
Software Dependencies	No	The paper mentions using "PyTorch implementations" but does not provide version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	To make the comparison fair, we ﬁxed number of parameters to 10M and 24M... All models are either 2 or 4 layers deep, except RHNs which were trained with the transition depth of 5... Batch size was ﬁxed to 250 and all the models are trained by backpropagating the error up till 50 time steps. We use the optimizer Adam(Kingma and Ba 2015) with an exponentially (factor of 0.9) decaying learning rate of 0.001, β1 = 0.1 and β2 = 0.001. All weights were initialized using Glorot initialization (Glorot and Bengio 2010). We optimize our models with Categorical Cross Entropy (CCE) as the loss function...