Variational Smoothing in Recurrent Neural Network Language Models

Authors: Lingpeng Kong, Gabor Melis, Wang Ling, Lei Yu, Dani Yogatama

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically verify our analysis on two benchmark language modeling datasets and demonstrate performance improvements over existing data noising methods.
Researcher Affiliation Industry Lingpeng Kong, Gabor Melis, Wang Ling, Lei Yu, Dani Yogatama Deep Mind {lingpenk, melisgl, lingwang, leiyu, dyogatama}@google.com
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any link or statement about making its source code publicly available.
Open Datasets Yes We evaluate our approaches on two standard language modeling datasets: Penn Treebank (Marcus et al., 1994) and Wikitext-2 (Merity et al., 2017).
Dataset Splits No The paper mentions using a 'development set' and 'test set' but does not specify the exact percentages or sample counts for training, validation, and test splits, nor does it cite predefined splits with specific details.
Hardware Specification No The paper does not specify the exact hardware (e.g., GPU/CPU models, memory, or specific computing infrastructure) used to run the experiments.
Software Dependencies No The paper mentions software components like LSTM and RMSprop but does not provide specific version numbers for these or other dependencies required for reproducibility.
Experiment Setup Yes We tune the RMSprop learning rate and ℓ2 regularization hyperparameter λ for all models on a development set by a grid search on {0.002, 0.003, 0.004} and {10 4, 10 3} respectively, and use perplexity on the development set to choose the best model. We also tune γ from {0.1, 0.2, 0.3, 0.4}. We use recurrent dropout (Semeniuta et al., 2016) for R and set it to 0.2, and apply (element-wise) input and output embedding dropouts for E and O and set it to 0.5 when E, O RV 512 and 0.7 when E, O RV 1024 based on preliminary experiments. We tie the input and output embedding matrices in all our experiments (i.e., E = O), except for the vanilla LSTM model, where we report results for both tied and untied.