Efficient Softmax Approximation for GPUs
Authors: Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments carried out on standard benchmarks, such as Euro Parl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax. We conduct an empirical complexity analysis of this model on recent GPUs. This leads us to define a realistic complexity model that is incorporated in the proposed optimization; Our approach provides a significant acceleration factor compared to the regular softmax, i.e., 2 to 10 speed-ups. Equivalently we improve the accuracy under computational constraints. Importantly, on the largest corpus, this higher efficiency empirically comes at no cost in accuracy for a given amount of training data, in contrast to concurrent approaches improving the efficiency. This section provides a set of experiments aiming at analyzing the trade-off between actual complexity and effectiveness of several strategies, in particular the approach presented in the previous section. First we describe our evaluation protocol, then we evaluate some of the properties of our model and finally we compare it on standard benchmark against standard baselines. |
| Researcher Affiliation | Industry | Édouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou Facebook AI Research {egrave,ajoulin,moustaphacisse,grangier,rvj}@fb.com |
| Pseudocode | No | The paper describes the methods textually and mathematically but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code of our method is available at https://github.com/facebookresearch/adaptive-softmax. |
| Open Datasets | Yes | Text8 is a standard compression dataset containing a pre-processed version of the first 100 million characters from Wikipedia in English. It has been recently used for language modeling (Mikolov et al., 2014) and has a vocabulary of 44k words. 1http://mattmahoney.net/dc/textdata. Europarl is a machine translation corpus, containing 20 languages (Koehn, 2005). 2http://www.statmt.org/europarl/. One Billion Word is a massive corpus introduced by Chelba et al. (2013). 3https://code.google.com/archive/p/1-billion-word-language-modeling-benchmark/ |
| Dataset Splits | No | The paper mentions evaluating perplexity 'on validation' data but does not provide specific details on the train/validation/test splits, such as percentages, sample counts, or citations to predefined splits for any of the datasets used. |
| Hardware Specification | No | All the experiments were run on the same GPU with the Maxwell architecture. This specifies the GPU architecture but not a specific model, processor type, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions using 'Adagrad (Duchi et al., 2011)' as an optimizer, but it does not specify any other software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | On Text8 and Europarl, the models have d = 512 hidden units and are regularized with weight decay (λ = 10 6). On the One Billion Word benchmark, we use d = 2048 hidden units and no regularization. The dimension of the input word embeddings is set to 256, so that large models fit in GPU memory. For the backpropagation through time, we unroll the models for 20 steps. We use Adagrad (Duchi et al., 2011), with a step size of 0.1 and 5 epochs, and we clip the norm of the gradients to 1. The batch size B is set to 128, except on the Finnish portion of Europarl where B=64 due to memory constraints. |