reproducibilityindex.ai

Pay Attention to MLPs

Authors: Hanxiao Liu, Zihang Dai, David So, Quoc V Le

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comparisons show that self-attention is not critical for Vision Transformers, as g MLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On ﬁnetuning tasks where g MLP performs worse, making the g MLP model substantially larger can close the gap with Transformers. In general, our experiments show that g MLP can scale as well as Transformers over increased data and compute.
Researcher Affiliation	Industry	Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le Google Research, Brain Team {hanxiaol,zihangd,davidso,qvl}@google.com
Pseudocode	Yes	Pseudo-code for the g MLP block def gmlp_block(x, d_model, d_ffn): shortcut = x x = norm(x, axis="channel") x = proj(x, d_ffn, axis="channel") x = gelu(x) x = spatial_gating_unit(x) x = proj(x, d_model, axis="channel") return x + shortcut def spatial_gating_unit(x): u, v = split(x, axis="channel") v = norm(v, axis="channel") n = get_dim(v, axis="spatial") v = proj(v, n, axis="spatial", init_bias=1) return u v
Open Source Code	No	The paper does not include an unambiguous statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Here we examine g MLP in the vision domain by applying it to the image classiﬁcation task on Image Net [31] without using extra data.
Dataset Splits	Yes	Here we examine g MLP in the vision domain by applying it to the image classiﬁcation task on Image Net [31] without using extra data. For ﬁnetuning, we report the dev-set performance for SST-2 and MNLI in GLUE [41] and each result entry was obtained by taking the median of ﬁve independent runs. In addition, we report ﬁnetuning results on SQu AD [44, 45] to test the models ability in reasoning over a longer context.
Hardware Specification	No	The paper does not mention any specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for running the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x, or specific library versions).
Experiment Setup	Yes	With a standard 256-batch size 1M-step training setup as in original BERT... For ablations and case studies, all models are trained with batch size 2048, max length 128 for 125K steps... For main results, models are trained with batch size 256, max length 512 for 1M steps... We adjust only the strengths of stochastic depth [32] as we move from smaller to larger models in Table 1. All the other hyperparameters remain shared across our three models.