Deep Residual Output Layers for Neural Language Generation

Authors: Nikolaos Pappas, James Henderson

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on three language generation tasks show that our output label mapping can match or improve state-of-the-art recurrent and self-attention architectures, and suggest that the classifier does not necessarily need to be highrank to better model natural language if it is better at capturing the structure of the output space.
Researcher Affiliation Academia 1Idiap Research Institute, Martigny, Switzerland. Correspondence to: Nikolaos Pappas <nikolaos.pappas@idiap.ch>.
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes Our code and settings are available at http://github.com/idiap/drill.
Open Datasets Yes Following previous work in language modeling (Yang et al., 2018; Krause et al., 2018; Merity et al., 2017; Melis et al., 2017), we evaluate the proposed model in terms of perplexity on two widely used language modeling datasets, namely Penn Treebank (Mikolov et al., 2010) and Wiki Text-2 (Merity et al., 2017) which have vocabularies of 10,000 and 33,278 words, respectively.
Dataset Splits Yes Our hyper-parameters were optimized based on validation perplexity, as follows: 4-layer label encoder depth, 400-dimensional label embeddings, 0.6 dropout rate, residual connection to E, uniform weight initialization in the interval [ 0.1, 0.1], for both datasets, and, furthermore, sigmoid activation and variational dropout for Penn Treebank, as well as relu activation and standard dropout for Wikitext-2. (Also for NMT): using the Newstest2013 set for validation and the Newstest2014 set for testing.
Hardware Specification No The paper does not specify the hardware (e.g., CPU, GPU models) used for running the experiments.
Software Dependencies No For the implementation of the AWD-LSTM we used the language modeling toolkit in Pytorch provided by Merity et al. (2017),3 and for the dynamic evaluation the code in Pytorch provided by Krause et al. (2018).4 (The paper mentions software like Pytorch and Open NMT, but without specific version numbers.)
Experiment Setup Yes Our hyper-parameters were optimized based on validation perplexity, as follows: 4-layer label encoder depth, 400-dimensional label embeddings, 0.6 dropout rate, residual connection to E, uniform weight initialization in the interval [ 0.1, 0.1], for both datasets, and, furthermore, sigmoid activation and variational dropout for Penn Treebank, as well as relu activation and standard dropout for Wikitext-2. (Also for NMT): 2-layer label encoder depth, 512-dimensional label embeddings, 0.0 dropout rate, sigmoid activation function, residual connection to E, and uniform weight initialization in [ 0.1, 0.1].