Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Authors: Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, William W. Cohen
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed approach on standard language modeling benchmarks. Mo S substantially improves over the current state-of-the-art results on benchmarks, by up to 3.6 points in terms of perplexity, reaching perplexities 47.69 on Penn Treebank and 40.68 on Wiki Text-2. We further apply Mo S to a dialog dataset and show improved performance over Softmax and other baselines. Our contribution is two-fold. First, we identify the Softmax bottleneck by formulating language modeling as a matrix factorization problem. Second, we propose a simple and effective method that substantially improves over the current state-of-the-art results. 3 EXPERIMENTS We conduct a series of experiments with the following settings: Following previous work (Krause et al., 2017; Merity et al., 2017; Melis et al., 2017), we evaluate the proposed Mo S model on two widely used language modeling datasets, namely Penn Treebank (PTB) (Mikolov et al., 2010) and Wiki Text-2 (WT2) (Merity et al., 2016) based on perplexity. |
| Researcher Affiliation | Academia | Zhilin Yang , Zihang Dai , Ruslan Salakhutdinov, William W. Cohen School of Computer Science Carnegie Mellon University {zhiliny,dzihang,rsalakhu,wcohen}@cs.cmu.edu |
| Pseudocode | No | The paper describes the model mathematically and structurally but does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Code is available at https://github.com/zihangdai/mos. |
| Open Datasets | Yes | We evaluate the proposed Mo S model on two widely used language modeling datasets, namely Penn Treebank (PTB) (Mikolov et al., 2010) and Wiki Text-2 (WT2) (Merity et al., 2016) based on perplexity. To investigate whether the effectiveness of Mo S can be extended to even larger datasets, we conduct an additional language modeling experiment on the 1B Word dataset (Chelba et al., 2013). We use the Switchboard dataset (Godfrey & Holliman, 1997) preprocessed by Zhao et al. (2017). |
| Dataset Splits | Yes | For fair comparison, we closely follow the regularization and optimization techniques introduced by Merity et al. (2017). We heuristically and manually search hyper-parameters for Mo S based on the validation performance while limiting the model size (see Appendix B.1 for our hyper-parameters). The language modeling results on PTB and WT2 are presented in Table 1 and Table 2 respectively. For validation, we use two shards from the heldout set, namely [heldout-00, heldout-10]. (1B Word dataset section) |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU model, CPU type, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like LSTMs and Seq2Seq, but it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers. |
| Experiment Setup | Yes | The hyper-parameters used for Mo S in language modeling experiment is summarized below. (Table 8) Hyper-parameter PTB WT2 Learning rate 20 15 Batch size 12 15 Embedding size 280 300 RNN hidden sizes [960, 960, 620] [1150,1150,650] Number of mixture components 15 15 Word-level V-dropout 0.10 0.10 Embedding V-dropout 0.55 0.40 Hidden state V-dropout 0.20 0.225 Recurrent weight dropout (Wan et al., 2013) 0.50 0.50 Context vector V-dropout 0.30 0.30. Table 9 also specifies hyperparameters for dynamic evaluation. |