Sigsoftmax: Reanalysis of the Softmax Bottleneck
Authors: Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of sigsoftmax, we conducted experiments on word-level language modeling. We compared sigsoftmax with softmax, the Re LU-based function and the sigmoid-based function. |
| Researcher Affiliation | Collaboration | Sekitoshi Kanai NTT Software Innovation Center, Keio Univ. kanai.sekitoshi@lab.ntt.co.jp Yasuhiro Fujiwara NTT Software Innovation Center fujiwara.yasuhiro@lab.ntt.co.jp Yuki Yamanaka NTT Secure Platform Laboratories yamanaka.yuki@lab.ntt.co.jp Shuichi Adachi Keio Univ. adachi.shuichi@appi.keio.ac.jp |
| Pseudocode | No | The paper contains mathematical formulations and definitions but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper provides links to code used for baselines and related work (e.g., 'https://github.com/salesforce/awd-lstm-lm', 'https://github.com/benkrause/dynamic-evaluation', 'https://github.com/zihangdai/mos') but does not explicitly state that the source code for their proposed method (sigsoftmax) is openly available or provide a link to it. |
| Open Datasets | Yes | We used Penn Treebank dataset (PTB) [19, 24] and Wiki Text-2 dataset (WT2) [22] by following the previous studies [23, 16, 34]. |
| Dataset Splits | Yes | PTB is split into a training set (about 930 k tokens), validation set (about 74 k tokens), and test set (about 82 k tokens). The vocabulary size M was set to 10 k, and all words outside the vocabulary were replaced with a special token. WT2 is a collection of tokens from the set of articles on Wikipedia. WT2 is also split into a training set (about 2100 k), validation set (about 220 k), and test set (about 250 k). The vocabulary size M was 33,278. |
| Hardware Specification | No | The paper does not explicitly describe the hardware specifications (e.g., specific GPU or CPU models) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., Python, PyTorch, or CUDA versions) required for replication. |
| Experiment Setup | Yes | For fair comparison, the experimental conditions, such as unit sizes, dropout rates, initialization, and the optimization method were the same as in the previous studies [23, 34, 16] except for the number of epochs by using their codes.3 We set the epochs to be twice as large as the original epochs used in [23] since the losses did not converge in the original epochs. |