Character-Aware Neural Language Models
Authors: Yoon Kim, Yacine Jernite, David Sontag, Alexander Rush
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010). |
| Researcher Affiliation | Academia | Yoon Kim School of Engineering and Applied Sciences Harvard University yoonkim@seas.harvard.edu Yacine Jernite Courant Institute of Mathematical Sciences New York University jernite@cs.nyu.edu David Sontag Courant Institute of Mathematical Sciences New York University dsontag@cs.nyu.edu Alexander M. Rush School of Engineering and Applied Sciences Harvard University srush@seas.harvard.edu |
| Pseudocode | No | The paper describes the model architecture and equations but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We have released all the code for the models described in this paper.1 1https://github.com/yoonkim/lstm-char-cnn |
| Open Datasets | Yes | We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010)...this version has been extensively used by the language modeling community and is publicly available.6 6http://www.fit.vutbr.cz/~imikolov/rnnlm/ Non Arabic data comes from the 2013 ACL Workshop on Machine Translation,7 and we use the same train/validation/test splits as in Botha and Blunsom (2014). 7http://www.statmt.org/wmt13/translation-task.html Arabic data comes from the News-Commentary corpus,9 and we perform our own preprocessing and train/validation/test splits. 9http://opus.lingfil.uu.se/News-Commentary.php |
| Dataset Splits | Yes | We conduct hyperparameter search, model introspection, and ablation studies on the English Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993), utilizing the standard training (0-20), validation (21-22), and test (23-24) splits along with pre-processing by Mikolov et al. (2010). |
| Hardware Specification | No | Only a general statement, 'All models were trained on GPUs with 2GB memory.', is provided. Specific GPU models, CPU models, or detailed computer specifications are not mentioned. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | The models are trained by truncated backpropagation through time (Werbos 1990; Graves 2013). We backpropagate for 35 time steps using stochastic gradient descent where the learning rate is initially set to 1.0 and halved if the perplexity does not decrease by more than 1.0 on the validation set after an epoch. On DATA-S we use a batch size of 20 and on DATA-L we use a batch size of 100 (for greater efficiency). Gradients are averaged over each batch. We train for 25 epochs on non-Arabic and 30 epochs on Arabic data (which was sufficient for convergence), picking the best performing model on the validation set. Parameters of the model are randomly initialized over a uniform distribution with support [ 0.05, 0.05]. For regularization we use dropout (Hinton et al. 2012) with probability 0.5 on the LSTM input-to-hidden layers (except on the initial Highway to LSTM layer) and the hidden-to-output softmax layer. We further constrain the norm of the gradients to be below 5, so that if the L2 norm of the gradient exceeds 5 then we renormalize it to have || || = 5 before updating. |