On the Softmax Bottleneck of Recurrent Language Models

Authors: Dwarak Govind Parthiban, Yongyi Mao, Diana Inkpen13640-13647

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we show via an extensive empirical study that such a correlation is fairly weak and that the high-rank of the log P matrix is neither necessary nor sufficient for better test perplexity. In our experiments, we reproduced the results of the baseline AWD-LSTM model and its SS, LMS-PLIF, and Mo S counterparts. The models were trained on the Penn Treebank (PTB) dataset.
Researcher Affiliation Academia Dwarak Govind Parthiban, Yongyi Mao, Diana Inkpen University of Ottawa yottabytt@gmail.com, ymao@uottawa.ca, diana.inkpen@uottawa.ca
Pseudocode No The paper describes functions and mathematical formulations but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The SM and code can be accessed at https://github.com/yottabytt/awd-lstm-lmkit .
Open Datasets Yes Following previous works (Yang et al. 2018; Kanai et al. 2018; Ganea et al. 2019), for our language modeling experiments, we use the Penn Treebank (PTB) (Marcus, Santorini, and Marcinkiewicz 1993) and the Wiki Text-2 (WT2) (Merity et al. 2017) datasets.
Dataset Splits No The paper reports 'Validation ppl' in Table 2, implying the use of a validation set, but it does not explicitly state the specific train/validation/test split percentages or sample counts used to create these partitions for reproducibility.
Hardware Specification Yes GPUs All model training and evaluation were conducted using NVIDIA s V100 GPUs with 32GB of memory. To train a single instance of a model, we use only one GPU and not multiple GPUs.
Software Dependencies No The paper states 'Most of our implementation is based on the open source code released by the authors of AWD-LSTM and Mo S,' but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries used.
Experiment Setup Yes Hyperparameter Configuration To train an AWD-LSTM based model, there is a hyperparameter called the non-monotone interval n that is used to switch the optimization algorithm from SGD to Averaged SGD. The Mo S model uses bsz = 12, d = 620, and d = 280 whereas the Softmax model uses bsz = 20, d = 400, and d = 400.