Variational Smoothing in Recurrent Neural Network Language Models
Authors: Lingpeng Kong, Gabor Melis, Wang Ling, Lei Yu, Dani Yogatama
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify our analysis on two benchmark language modeling datasets and demonstrate performance improvements over existing data noising methods. |
| Researcher Affiliation | Industry | Lingpeng Kong, Gabor Melis, Wang Ling, Lei Yu, Dani Yogatama Deep Mind {lingpenk, melisgl, lingwang, leiyu, dyogatama}@google.com |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any link or statement about making its source code publicly available. |
| Open Datasets | Yes | We evaluate our approaches on two standard language modeling datasets: Penn Treebank (Marcus et al., 1994) and Wikitext-2 (Merity et al., 2017). |
| Dataset Splits | No | The paper mentions using a 'development set' and 'test set' but does not specify the exact percentages or sample counts for training, validation, and test splits, nor does it cite predefined splits with specific details. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU/CPU models, memory, or specific computing infrastructure) used to run the experiments. |
| Software Dependencies | No | The paper mentions software components like LSTM and RMSprop but does not provide specific version numbers for these or other dependencies required for reproducibility. |
| Experiment Setup | Yes | We tune the RMSprop learning rate and ℓ2 regularization hyperparameter λ for all models on a development set by a grid search on {0.002, 0.003, 0.004} and {10 4, 10 3} respectively, and use perplexity on the development set to choose the best model. We also tune γ from {0.1, 0.2, 0.3, 0.4}. We use recurrent dropout (Semeniuta et al., 2016) for R and set it to 0.2, and apply (element-wise) input and output embedding dropouts for E and O and set it to 0.5 when E, O RV 512 and 0.7 when E, O RV 1024 based on preliminary experiments. We tie the input and output embedding matrices in all our experiments (i.e., E = O), except for the vanilla LSTM model, where we report results for both tied and untied. |