reproducibilityindex.ai

Mixtape: Breaking the Softmax Bottleneck Efficiently

Authors: Zhilin Yang, Thang Luong, Russ R. Salakhutdinov, Quoc V. Le

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efﬁciency over the Mo S layer by 3.5x to 10.5x while obtaining similar performance.
Researcher Affiliation	Collaboration	1Carnegie Mellon University, 2Google Brain
Pseudocode	No	The paper describes the steps of the Mixtape layer in a numbered list (Section 3.4), but it is descriptive text rather than formal pseudocode or an algorithm block.
Open Source Code	No	The paper does not contain any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets	Yes	For language modeling, we exactly follow the settings in [19] on Penn Treebank [12] and One Billion Word [4] for fair comparison. For machine translation, our experiments are based on two widely-used WMT 14 benchmarks, English to German (En-De) and English to French (En-Fr), following the setups in [13, 18].
Dataset Splits	No	The paper mentions training and testing datasets (e.g., 'WMT 16 training data' and 'newstest14') but does not explicitly provide percentages or counts for training, validation, and test splits, nor does it specify how a validation set was explicitly used or created for hyperparameter tuning beyond general setup references.
Hardware Specification	No	The paper mentions 'GPU memory budget' but does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running the experiments.
Software Dependencies	No	The paper mentions 'sacrebleu' and 'TensorFlow' (via Mesh TensorFlow), but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	On En-De, we employ a 6-layer Transformer with embedding size 1024, inner layer size 4096, and 16 attention heads. We train for 300K steps with a learning rate of 2.5, a batch size of 4096, and 16K warmup steps. We apply a dropout of 0.3 on the layer outputs, a dropout of 0.15 on attention probabilities, a dropout of 0.2 on tanh(Ukgc) in Eq. (4), and a Gaussian noise with 0.1 stdev on pre-activation gate priors.