Mixtape: Breaking the Softmax Bottleneck Efficiently

Authors: Zhilin Yang, Thang Luong, Russ R. Salakhutdinov, Quoc V. Le

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the Mo S layer by 3.5x to 10.5x while obtaining similar performance.
Researcher Affiliation Collaboration 1Carnegie Mellon University, 2Google Brain
Pseudocode No The paper describes the steps of the Mixtape layer in a numbered list (Section 3.4), but it is descriptive text rather than formal pseudocode or an algorithm block.
Open Source Code No The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets Yes For language modeling, we exactly follow the settings in [19] on Penn Treebank [12] and One Billion Word [4] for fair comparison. For machine translation, our experiments are based on two widely-used WMT 14 benchmarks, English to German (En-De) and English to French (En-Fr), following the setups in [13, 18].
Dataset Splits No The paper mentions training and testing datasets (e.g., 'WMT 16 training data' and 'newstest14') but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it specify how a validation set was explicitly used or created for hyperparameter tuning beyond general setup references.
Hardware Specification No The paper mentions 'GPU memory budget' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies No The paper mentions 'sacrebleu' and 'TensorFlow' (via Mesh TensorFlow), but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes On En-De, we employ a 6-layer Transformer with embedding size 1024, inner layer size 4096, and 16 attention heads. We train for 300K steps with a learning rate of 2.5, a batch size of 4096, and 16K warmup steps. We apply a dropout of 0.3 on the layer outputs, a dropout of 0.15 on attention probabilities, a dropout of 0.2 on tanh(Ukgc) in Eq. (4), and a Gaussian noise with 0.1 stdev on pre-activation gate priors.