Neural Language Modeling by Jointly Learning Syntax and Lexicon
Authors: Yikang Shen, Zhouhan Lin, Chin-wei Huang, Aaron Courville
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our model on three tasks: word-level language modeling, character-level language modeling, and unsupervised constituency parsing. The proposed model achieves (or is close to) the state-of-the-art on both word-level and character-level language modeling. The model s unsupervised parsing outperforms some strong baseline models, demonstrating that the structure found by our model is similar to the intrinsic structure provided by human experts. |
| Researcher Affiliation | Academia | Yikang Shen, Zhouhan Lin, Chin-Wei Huang & Aaron Courville Department of Computer Science and Operations Research Universit de Montral Montral, QC H3C3J7, Canada {yi-kang.shen, zhouhan.lin, chin-wei.huang, aaron.courville}@umontreal.ca |
| Pseudocode | No | The paper describes the model architecture and mathematical formulations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any links to open-source code or statements about code availability. |
| Open Datasets | Yes | We evaluate a character-level variant of our proposed language model over a preprocessed version of the Penn Treebank (PTB) and Text8 datasets. Penn Treebank we process the Penn Treebank dataset (Marcus et al., 1993) by following the procedure introduced in (Mikolov et al., 2012). Text8 dataset contains 17M training tokens and has a vocabulary size of 44k words. The dataset is partitioned into a training set (first 99M characters) and a development set (last 1M characters) that is used to report performance. |
| Dataset Splits | Yes | The dataset is partitioned into a training set (first 99M characters) and a development set (last 1M characters) that is used to report performance. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments. |
| Software Dependencies | No | The paper mentions software components like 'Adam' optimizer, 'Layer Normalization', and 'Batch Normalization', but does not provide specific version numbers for these or any other software dependencies such as programming languages or deep learning frameworks. |
| Experiment Setup | Yes | Optimization is performed with Adam using learning rate lr = 0.003, weight decay wdecay = 10 6, β1 = 0.9, β2 = 0.999 and σ = 10 8. We carry out gradient clipping with maximum norm 1.0. For character-level PTB, Reading Network has two recurrent layers, Predict Network has one residual block. Hidden state size is 1024 units. The input and output embedding size are 128, and not shared. Look-back range L = 10, temperature parameter τ = 10, upper band of memory span Nm = 20. We use a batch size of 64, truncated backpropagation with 100 timesteps. The values used of dropout on input/output embeddings, between recurrent layers, and on recurrent states were (0, 0.25, 0.1) respectively. |