reproducibilityindex.ai

Efficient softmax approximation for GPUs

Authors: Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments carried out on standard benchmarks, such as Euro Parl and One Billion Word, show that our approach brings a large gain in efﬁciency over standard approximations while achieving an accuracy close to that of the full softmax. The code of our method is available at https://github.com/ facebookresearch/adaptive-softmax.
Researcher Affiliation	Industry	Edouard Grave 1 Armand Joulin 1 Moustapha Ciss e 1 David Grangier 1 Herv e J egou 1 1Facebook AI Research. Correspondence to: Edouard Grave <egrave@fb.com>.
Pseudocode	No	The paper describes the proposed method verbally and mathematically, but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	The code of our method is available at https://github.com/ facebookresearch/adaptive-softmax.
Open Datasets	Yes	We evaluate our method on standard datasets, and use the perplexity (ppl) as an evaluation metric, as the function of the training time or of the number of training data (epochs). The datasets have varying vocabulary sizes, in different languages, which allows us to better understand the strengths and weaknesses of the different approaches. Text8 is a standard compression dataset containing a pre-processed version of the ﬁrst 100 million characters from Wikipedia in English. It has been recently used for language modeling (Mikolov et al., 2014) and has a vocabulary of 44k words. Europarl is a machine translation corpus, containing 20 languages (Koehn, 2005). For most languages, there are 10M 60M tokens and the vocabulary is in between 44k and 250k words. One Billion Word is a massive corpus introduced by Chelba et al. (2013). It contains 0.8B tokens and a vocabulary comprising almost 800k words.
Dataset Splits	No	The paper mentions using perplexity (on validation) in Figure 5 and refers to standard datasets, but it does not specify explicit percentages or sample counts for training, validation, or test splits for reproduction, nor does it cite specific predefined splits with attribution for these datasets.
Hardware Specification	Yes	Figure 1 reports empirical timings as a function of k for typical parameters of B and d for two GPU models, namely K40 and M40. We observe that the computation time g(k) is constant for low values of k, until a certain inﬂection point k0 50, and then becomes afﬁne for values k > k0. Empirically, cm = 0.40 ms on a K40 and 0.22 ms on a M40. All the experiments were run on the same GPU with the Maxwell architecture.
Software Dependencies	No	The paper mentions "cu BLAS" as an efficient implementation for matrix multiplication and "Adagrad (Duchi et al., 2011)" as the optimization scheme, but it does not provide specific version numbers for these or any other software components.
Experiment Setup	Yes	We use an LSTM with one layer in all our experiments. On Text8 and Europarl, the models have d = 512 hidden units and are regularized with weight decay (λ = 10 6). On the One Billion Word benchmark, we use d = 2048 hidden units and no regularization. The dimension of the input word embeddings is set to 256, so that large models ﬁt in GPU memory. For the backpropagation through time, we unroll the models for 20 steps. We use Adagrad (Duchi et al., 2011), with a step size of 0.1 and 5 epochs, and we clip the norm of the gradients to 1. The batch size B is set to 128, except on the Finnish portion of Europarl where B=64 due to memory constraints.