Trellis Networks for Sequence Modeling

Authors: Shaojie Bai, J. Zico Kolter, Vladlen Koltun

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that trellis networks outperform the current state of the art methods on a variety of challenging benchmarks, including word-level language modeling and character-level language modeling tasks, and stress tests designed to evaluate long-term memory retention.
Researcher Affiliation Collaboration Shaojie Bai Carnegie Mellon University J. Zico Kolter Carnegie Mellon University and Bosch Center for AI Vladlen Koltun Intel Labs
Pseudocode No The paper describes the Trellis Network architecture and its computations using mathematical equations and descriptive text, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code is available here1. 1https://github.com/locuslab/trellisnet
Open Datasets Yes We evaluate trellis networks on challenging benchmarks, including word-level language modeling on the standard Penn Treebank (PTB) and the much larger Wiki Text-103 (WT103) datasets; character-level language modeling on Penn Treebank; and standard stress tests (e.g. sequential MNIST, permuted MNIST, etc.)... The original Penn Treebank (PTB) dataset... (Marcus et al., 1993)... Wiki Text-103 (WT103)... (Merity et al., 2017)... The MNIST handwritten digits dataset (Le Cun et al., 1989)... The CIFAR-10 dataset (Krizhevsky & Hinton, 2009)...
Dataset Splits Yes the PTB dataset contains 888K words for training, 70K for validation and 79K for testing... WT103... with 103M words for training, 218K words for validation, and 246K words for testing/evaluation.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes Table 5 specifies the trellis networks used for the various tasks. There are a few things to note while reading the table. First, in training, we decay the learning rate once the validation error plateaus for a while (or according to some fixed schedule, such as after 100 epochs). Second, for auxiliary loss (see Appendix B for more details), we insert the loss function after every fixed number of layers in the network. This frequency is included below under the Auxiliary Frequency entry. Finally, the hidden dropout in the Table refers to the variational dropout we translated from RNNs (see Appendix B), which is applied at all hidden layers of the Trellis Net.