SVD-Softmax: Fast Softmax Approximation on Large Vocabulary Neural Networks

Authors: Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, Wonyong Sung

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that the proposed algorithm provides both fast and accurate evaluation of the most probable top-K word probabilities.
Researcher Affiliation Academia Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, Wonyong Sung Department of Electrical and Computer Engineering Seoul National University, Seoul, Korea skhu20@snu.ac.kr, {mjlee, ischoi, yhboo}@dsp.snu.ac.kr, wysung@snu.ac.kr
Pseudocode Yes Algorithm 1 Algorithm of the proposed SVD-softmax.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes The Wiki Text-2 [20] and One Billion Word benchmark (OBW) [21] datasets were used for language modeling.
Dataset Splits No The paper mentions training data and evaluation data (e.g., 'approximately 2M training tokens', 'One thousand sequential frames were used for the evaluation', 'evaluated with newstest 2013'), but it does not provide specific percentages, sample counts, or explicit citations for reproducible training/validation/test splits across all datasets.
Hardware Specification Yes The experiment was conducted on a NVIDIA GTX Titan-X (Pascal) GPU and Intel i7-6850 CPU.
Software Dependencies No The paper mentions tools like 'Open NMT toolkit' and 'Moses toolkit' but does not specify version numbers for these or any other software dependencies, such as deep learning frameworks or specific libraries.
Experiment Setup Yes The models were trained with stochastic gradient descent (SGD) with an initial learning rate of 1.0 and momentum of 0.95. The batch size was set to 20, and the network was unrolled for 35 timesteps. Dropout [23] was applied to the LSTM output with a drop ratio of 0.5. Gradient clipping [24] of maximum norm value 5 was applied.