Mixtape: Breaking the Softmax Bottleneck Efficiently
Authors: Zhilin Yang, Thang Luong, Russ R. Salakhutdinov, Quoc V. Le
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the Mo S layer by 3.5x to 10.5x while obtaining similar performance. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2Google Brain |
| Pseudocode | No | The paper describes the steps of the Mixtape layer in a numbered list (Section 3.4), but it is descriptive text rather than formal pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | For language modeling, we exactly follow the settings in [19] on Penn Treebank [12] and One Billion Word [4] for fair comparison. For machine translation, our experiments are based on two widely-used WMT 14 benchmarks, English to German (En-De) and English to French (En-Fr), following the setups in [13, 18]. |
| Dataset Splits | No | The paper mentions training and testing datasets (e.g., 'WMT 16 training data' and 'newstest14') but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it specify how a validation set was explicitly used or created for hyperparameter tuning beyond general setup references. |
| Hardware Specification | No | The paper mentions 'GPU memory budget' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions 'sacrebleu' and 'TensorFlow' (via Mesh TensorFlow), but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | On En-De, we employ a 6-layer Transformer with embedding size 1024, inner layer size 4096, and 16 attention heads. We train for 300K steps with a learning rate of 2.5, a batch size of 4096, and 16K warmup steps. We apply a dropout of 0.3 on the layer outputs, a dropout of 0.15 on attention probabilities, a dropout of 0.2 on tanh(Ukgc) in Eq. (4), and a Gaussian noise with 0.1 stdev on pre-activation gate priors. |