Character-Aware Neural Language Models

Authors: Yoon Kim, Yacine Jernite, David Sontag, Alexander Rush

AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010).
Researcher Affiliation Academia Yoon Kim School of Engineering and Applied Sciences Harvard University yoonkim@seas.harvard.edu Yacine Jernite Courant Institute of Mathematical Sciences New York University jernite@cs.nyu.edu David Sontag Courant Institute of Mathematical Sciences New York University dsontag@cs.nyu.edu Alexander M. Rush School of Engineering and Applied Sciences Harvard University srush@seas.harvard.edu
Pseudocode No The paper describes the model architecture and equations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We have released all the code for the models described in this paper.1 1https://github.com/yoonkim/lstm-char-cnn
Open Datasets Yes We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010)...this version has been extensively used by the language modeling community and is publicly available.6 6http://www.fit.vutbr.cz/~imikolov/rnnlm/ Non Arabic data comes from the 2013 ACL Workshop on Machine Translation,7 and we use the same train/validation/test splits as in Botha and Blunsom (2014). 7http://www.statmt.org/wmt13/translation-task.html Arabic data comes from the News-Commentary corpus,9 and we perform our own preprocessing and train/validation/test splits. 9http://opus.lingfil.uu.se/News-Commentary.php
Dataset Splits Yes We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010).
Hardware Specification No Only a general statement, 'All models were trained on GPUs with 2GB memory.', is provided. Specific GPU models, CPU models, or detailed computer specifications are not mentioned.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes The models are trained by truncated backpropagation through time (Werbos 1990; Graves 2013). We backpropagate for 35 time steps using stochastic gradient descent where the learning rate is initially set to 1.0 and halved if the perplexity does not decrease by more than 1.0 on the validation set after an epoch. On DATA-S we use a batch size of 20 and on DATA-L we use a batch size of 100 (for greater efficiency). Gradients are averaged over each batch. We train for 25 epochs on non-Arabic and 30 epochs on Arabic data (which was sufficient for convergence), picking the best performing model on the validation set. Parameters of the model are randomly initialized over a uniform distribution with support [ 0.05, 0.05]. For regularization we use dropout (Hinton et al. 2012) with probability 0.5 on the LSTM input-to-hidden layers (except on the initial Highway to LSTM layer) and the hidden-to-output softmax layer. We further constrain the norm of the gradients to be below 5, so that if the L2 norm of the gradient exceeds 5 then we renormalize it to have || || = 5 before updating.