Language Modeling with Gated Convolutional Networks

Authors: Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report results on two public large-scale language modeling datasets. First, the Google Billion Word dataset (Chelba et al., 2013)... Second, Wiki Text-103... (Merity et al., 2016). We compare the different gating schemes experimentally in Section 5.2
Researcher Affiliation Industry 1Facebook AI Research. Correspondence to: Yann N. Dauphin <ynd@fb.com>.
Pseudocode No The paper presents mathematical equations and describes the architecture verbally but does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We report results on two public large-scale language modeling datasets. First, the Google Billion Word dataset (Chelba et al., 2013)... Second, Wiki Text-103... (Merity et al., 2016).
Dataset Splits No The paper states 'We found good hyper-parameter configurations by crossvalidating with random search on a validation set.' and mentions evaluating 'on the standard held out test portion of each dataset'. However, it does not provide specific percentages or counts for the validation set, nor detailed methodology for its split.
Hardware Specification Yes We implement our models in Torch (Collobert et al., 2011) and train on Tesla M40 GPUs. The majority of our models are trained on single GPU... We trained larger models with an 8-GPU setup...
Software Dependencies No The paper states 'We implement our models in Torch (Collobert et al., 2011)'. While the citation points to 'Torch7', the text itself only mentions 'Torch' without an explicit version number for the software used in their implementation.
Experiment Setup Yes In terms of optimization, we initialize the layers of the model with the Kaiming initialization (He et al., 2015b), with the learning rate sampled uniformly in the interval [1., 2.], the momentum set to 0.99, and clipping set to 0.1.