Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Authors: Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost. |
| Researcher Affiliation | Collaboration | 1Google Brain, {noam,azalia,andydavis,qvl,geoffhinton,jeff}@google.com 2Jagiellonian University, Cracow, krzysztof.maziarz@student.uj.edu.pl |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using TensorFlow but does not provide a link or explicit statement about the availability of their own source code for the described methodology. |
| Open Datasets | Yes | This dataset, introduced by (Chelba et al., 2013) consists of shuffled unique sentences from news articles, totaling approximately 829 million words, with a vocabulary of 793,471 words. We benchmarked our method on the WMT 14 En Fr and En De corpora, whose training sets have 36M sentence pairs and 5M sentence pairs, respectively. |
| Dataset Splits | Yes | The combination of newstest2012 and newstest2013 was used as the development set. |
| Hardware Specification | Yes | We trained our models using TensorFlow (Abadi et al., 2016) on clusters containing 16-32 Tesla K40 GPUs. For each of our models, we determine computational efficiency in TFLOPS/GPU by dividing the number of floating point operations required to process one training batch by the observed step time and the number of GPUs in the cluster. |
| Software Dependencies | No | The paper mentions using TensorFlow but does not specify its version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | We used the Adam optimizer (Kingma & Ba, 2015). The base learning rate was increased linearly for the first 1000 training steps, and decreased after that so as to be proportional to the inverse square root of the step number. The Softmax output layer was trained efficiently using importance sampling similarly to the models in (Jozefowicz et al., 2016). For each model, we performed a hyper-parmeter search to find the best dropout probability, in increments of 0.1. To ensure balanced expert utilization we set wimportance = 0.1 and wload = 0.1, as described in Section 4 and Appendix A. |