Improving Language Modelling with Noise Contrastive Estimation

Authors: Farhana Ferdousi Liza, Marek Grzes

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using a popular benchmark, we showed that appropriate tuning of NCE in neural language models outperforms the state-of-the-art single-model methods based on standard dropout and the standard LSTM recurrent neural networks. Sections 4 and 5 describe the experimental design and the results showing that the proposed method improves the state-of-the-art results on the Penn Tree Bank dataset using language modelling based on a standard LSTM (Hochreiter and Schmidhuber 1997; Gers 2001).
Researcher Affiliation Academia Farhana Ferdousi Liza, Marek Grzes School of Computing, University of Kent Canterbury, CT2 7NF, UK {fl207, m.grzes}@kent.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There is no mention of code release or links to repositories.
Open Datasets Yes To demonstrate that, we used the Penn Tree Bank (PTB) dataset1 (Marcus, Marcinkiewicz, and Santorini 1993), which is a popular language modelling benchmark with a vocabulary size of 10k words. 1http://www.fit.vutbr.cz/ imikolov/rnnlm/simple-examples.tgz
Dataset Splits Yes The PTB dataset consists of 929k training words, 73k validation words, and 82k test words.
Hardware Specification Yes All models were implemented in Tensorflow2 and executed on NVIDIA K80 GPUs.
Software Dependencies No The paper mentions 'Tensorflow' but does not specify a version number, nor does it list any other software components with specific versions.
Experiment Setup Yes All the models have two LSTM layers with the hidden layer size of 200 (S), 650 (M), and 1500 (L). The LSTM was unrolled for 20 time steps for the small model and 35 time steps for the medium and large models. We used mini-batch SGD for training where the mini batch size was 20. The learning rate was scheduled using Eq. 9. The search time limit τ was chosen empirically using Fig. 1. As a result, τ was set to 7, 25 and 12 for the small, medium and larger models correspondingly. During the convergence period, the parameter ψ was set to 2, 1.2 and 1.15 for the small, medium and larger models as suggested by (Zaremba, Sutskever, and Vinyals 2014). We trained the models for 20, 39 and 55 epochs respectively. The norm of the gradients (which was normalised by the mini batch size) was clipped at 5 and 10 for the medium and large models correspondingly. In NCE, we used 600 noise samples.