On the Softmax Bottleneck of Recurrent Language Models
Authors: Dwarak Govind Parthiban, Yongyi Mao, Diana Inkpen13640-13647
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we show via an extensive empirical study that such a correlation is fairly weak and that the high-rank of the log P matrix is neither necessary nor sufficient for better test perplexity. In our experiments, we reproduced the results of the baseline AWD-LSTM model and its SS, LMS-PLIF, and Mo S counterparts. The models were trained on the Penn Treebank (PTB) dataset. |
| Researcher Affiliation | Academia | Dwarak Govind Parthiban, Yongyi Mao, Diana Inkpen University of Ottawa yottabytt@gmail.com, ymao@uottawa.ca, diana.inkpen@uottawa.ca |
| Pseudocode | No | The paper describes functions and mathematical formulations but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The SM and code can be accessed at https://github.com/yottabytt/awd-lstm-lmkit . |
| Open Datasets | Yes | Following previous works (Yang et al. 2018; Kanai et al. 2018; Ganea et al. 2019), for our language modeling experiments, we use the Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993) and the Wiki Text-2 (WT2) (Merity et al. 2017) datasets. |
| Dataset Splits | No | The paper reports 'Validation ppl' in Table 2, implying the use of a validation set, but it does not explicitly state the specific train/validation/test split percentages or sample counts used to create these partitions for reproducibility. |
| Hardware Specification | Yes | GPUs All model training and evaluation were conducted using NVIDIA s V100 GPUs with 32GB of memory. To train a single instance of a model, we use only one GPU and not multiple GPUs. |
| Software Dependencies | No | The paper states 'Most of our implementation is based on the open source code released by the authors of AWD-LSTM and Mo S,' but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used. |
| Experiment Setup | Yes | Hyperparameter Configuration To train an AWD-LSTM based model, there is a hyperparameter called the non-monotone interval n that is used to switch the optimization algorithm from SGD to Averaged SGD. The Mo S model uses bsz = 12, d = 620, and d = 280 whereas the Softmax model uses bsz = 20, d = 400, and d = 400. |