Sigsoftmax: Reanalysis of the Softmax Bottleneck

Authors: Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of sigsoftmax, we conducted experiments on word-level language modeling. We compared sigsoftmax with softmax, the Re LU-based function and the sigmoid-based function.
Researcher Affiliation Collaboration Sekitoshi Kanai NTT Software Innovation Center, Keio Univ. kanai.sekitoshi@lab.ntt.co.jp Yasuhiro Fujiwara NTT Software Innovation Center fujiwara.yasuhiro@lab.ntt.co.jp Yuki Yamanaka NTT Secure Platform Laboratories yamanaka.yuki@lab.ntt.co.jp Shuichi Adachi Keio Univ. adachi.shuichi@appi.keio.ac.jp
Pseudocode No The paper contains mathematical formulations and definitions but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper provides links to code used for baselines and related work (e.g., 'https://github.com/salesforce/awd-lstm-lm', 'https://github.com/benkrause/dynamic-evaluation', 'https://github.com/zihangdai/mos') but does not explicitly state that the source code for their proposed method (sigsoftmax) is openly available or provide a link to it.
Open Datasets Yes We used Penn Treebank dataset (PTB) [19, 24] and Wiki Text-2 dataset (WT2) [22] by following the previous studies [23, 16, 34].
Dataset Splits Yes PTB is split into a training set (about 930 k tokens), validation set (about 74 k tokens), and test set (about 82 k tokens). The vocabulary size M was set to 10 k, and all words outside the vocabulary were replaced with a special token. WT2 is a collection of tokens from the set of articles on Wikipedia. WT2 is also split into a training set (about 2100 k), validation set (about 220 k), and test set (about 250 k). The vocabulary size M was 33,278.
Hardware Specification No The paper does not explicitly describe the hardware specifications (e.g., specific GPU or CPU models) used for running the experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or CUDA versions) required for replication.
Experiment Setup Yes For fair comparison, the experimental conditions, such as unit sizes, dropout rates, initialization, and the optimization method were the same as in the previous studies [23, 34, 16] except for the number of epochs by using their codes.3 We set the epochs to be twice as large as the original epochs used in [23] since the losses did not converge in the original epochs.