Language Modeling with Recurrent Highway Hypernetworks

Authors: Joseph Suarez

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present extensive experimental and theoretical support for the efficacy of recurrent highway networks (RHNs) and recurrent hypernetworks complimentary to the original works. Where the original RHN work primarily provides theoretical treatment of the subject, we demonstrate experimentally that RHNs benefit from far better gradient flow than LSTMs in addition to their improved task accuracy. The original hypernetworks work presents detailed experimental results but leaves several theoretical issues unresolved we consider these in depth and frame several feasible solutions that we believe will yield further gains in the future. We demonstrate that these approaches are complementary: by combining RHNs and hypernetworks, we make a significant improvement over current state-of-the-art character-level language modeling performance on Penn Treebank while relying on much simpler regularization. Finally, we argue for RHNs as a drop-in replacement for LSTMs (analogous to LSTMs for vanilla RNNs) and for hypernetworks as a de-facto augmentation (analogous to attention) for recurrent architectures.
Researcher Affiliation Academia Joseph Suarez Stanford University joseph15@stanford.edu
Pseudocode No The paper presents mathematical equations for recurrent cells but does not include a block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes open sourcing (code 3) a combined architecture that obtains SOTA on PTB... 3 github.com/jsuarez5341/Recurrent-Highway-Hypernetworks-NIPS
Open Datasets Yes Penn Treebank (PTB) contains approximately 5.10M/0.40M/0.45M characters in the train/val/test sets respectively... [16] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313 330, 1993.
Dataset Splits Yes Penn Treebank (PTB) contains approximately 5.10M/0.40M/0.45M characters in the train/val/test sets respectively
Hardware Specification Yes on a single GTX 1080 Ti
Software Dependencies No The paper mentions using 'Adam' optimizer but does not specify any software library versions (e.g., PyTorch 1.x, TensorFlow 2.x) required for replication.
Experiment Setup Yes We train all models using Adam [20] with the default learning rate 0.001 and sequence length 100, batch size 256... Both subnetworks use a recurrent dropout keep probability of 0.65 and no other regularizer/normalizer.