Efficient softmax approximation for GPUs

Authors: Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments carried out on standard benchmarks, such as Euro Parl and One Billion Word, show that our approach brings a large gain in efficiency over standard approximations while achieving an accuracy close to that of the full softmax. The code of our method is available at https://github.com/ facebookresearch/adaptive-softmax.
Researcher Affiliation Industry Edouard Grave 1 Armand Joulin 1 Moustapha Ciss e 1 David Grangier 1 Herv e J egou 1 1Facebook AI Research. Correspondence to: Edouard Grave <egrave@fb.com>.
Pseudocode No The paper describes the proposed method verbally and mathematically, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes The code of our method is available at https://github.com/ facebookresearch/adaptive-softmax.
Open Datasets Yes We evaluate our method on standard datasets, and use the perplexity (ppl) as an evaluation metric, as the function of the training time or of the number of training data (epochs). The datasets have varying vocabulary sizes, in different languages, which allows us to better understand the strengths and weaknesses of the different approaches. Text8 is a standard compression dataset containing a pre-processed version of the first 100 million characters from Wikipedia in English. It has been recently used for language modeling (Mikolov et al., 2014) and has a vocabulary of 44k words. Europarl is a machine translation corpus, containing 20 languages (Koehn, 2005). For most languages, there are 10M 60M tokens and the vocabulary is in between 44k and 250k words. One Billion Word is a massive corpus introduced by Chelba et al. (2013). It contains 0.8B tokens and a vocabulary comprising almost 800k words.
Dataset Splits No The paper mentions using perplexity (on validation) in Figure 5 and refers to standard datasets, but it does not specify explicit percentages or sample counts for training, validation, or test splits for reproduction, nor does it cite specific predefined splits with attribution for these datasets.
Hardware Specification Yes Figure 1 reports empirical timings as a function of k for typical parameters of B and d for two GPU models, namely K40 and M40. We observe that the computation time g(k) is constant for low values of k, until a certain inflection point k0 50, and then becomes affine for values k > k0. Empirically, cm = 0.40 ms on a K40 and 0.22 ms on a M40. All the experiments were run on the same GPU with the Maxwell architecture.
Software Dependencies No The paper mentions "cu BLAS" as an efficient implementation for matrix multiplication and "Adagrad (Duchi et al., 2011)" as the optimization scheme, but it does not provide specific version numbers for these or any other software components.
Experiment Setup Yes We use an LSTM with one layer in all our experiments. On Text8 and Europarl, the models have d = 512 hidden units and are regularized with weight decay (λ = 10 6). On the One Billion Word benchmark, we use d = 2048 hidden units and no regularization. The dimension of the input word embeddings is set to 256, so that large models fit in GPU memory. For the backpropagation through time, we unroll the models for 20 steps. We use Adagrad (Duchi et al., 2011), with a step size of 0.1 and 5 epochs, and we clip the norm of the gradients to 1. The batch size B is set to 128, except on the Finnish portion of Europarl where B=64 due to memory constraints.