Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

Authors: Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed approach on standard language modeling benchmarks. Mo S substantially improves over the current state-of-the-art results on benchmarks, by up to 3.6 points in terms of perplexity, reaching perplexities 47.69 on Penn Treebank and 40.68 on Wiki Text-2. We further apply Mo S to a dialog dataset and show improved performance over Softmax and other baselines. Our contribution is two-fold. First, we identify the Softmax bottleneck by formulating language modeling as a matrix factorization problem. Second, we propose a simple and effective method that substantially improves over the current state-of-the-art results. 3 EXPERIMENTS We conduct a series of experiments with the following settings: Following previous work (Krause et al., 2017; Merity et al., 2017; Melis et al., 2017), we evaluate the proposed Mo S model on two widely used language modeling datasets, namely Penn Treebank (PTB) (Mikolov et al., 2010) and Wiki Text-2 (WT2) (Merity et al., 2016) based on perplexity.
Researcher Affiliation Academia Zhilin Yang , Zihang Dai , Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu
Pseudocode No The paper describes the model mathematically and structurally but does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code Yes Code is available at https://github.com/zihangdai/mos.
Open Datasets Yes We evaluate the proposed Mo S model on two widely used language modeling datasets, namely Penn Treebank (PTB) (Mikolov et al., 2010) and Wiki Text-2 (WT2) (Merity et al., 2016) based on perplexity. To investigate whether the effectiveness of Mo S can be extended to even larger datasets, we conduct an additional language modeling experiment on the 1B Word dataset (Chelba et al., 2013). We use the Switchboard dataset (Godfrey & Holliman, 1997) preprocessed by Zhao et al. (2017).
Dataset Splits Yes For fair comparison, we closely follow the regularization and optimization techniques introduced by Merity et al. (2017). We heuristically and manually search hyper-parameters for Mo S based on the validation performance while limiting the model size (see Appendix B.1 for our hyper-parameters). The language modeling results on PTB and WT2 are presented in Table 1 and Table 2 respectively. For validation, we use two shards from the heldout set, namely [heldout-00, heldout-10]. (1B Word dataset section)
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments.
Software Dependencies No The paper mentions models like LSTMs and Seq2Seq, but it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers.
Experiment Setup Yes The hyper-parameters used for Mo S in language modeling experiment is summarized below. (Table 8) Hyper-parameter PTB WT2 Learning rate 20 15 Batch size 12 15 Embedding size 280 300 RNN hidden sizes [960, 960, 620] [1150,1150,650] Number of mixture components 15 15 Word-level V-dropout 0.10 0.10 Embedding V-dropout 0.55 0.40 Hidden state V-dropout 0.20 0.225 Recurrent weight dropout (Wan et al., 2013) 0.50 0.50 Context vector V-dropout 0.30 0.30. Table 9 also specifies hyperparameters for dynamic evaluation.