Pay Attention to MLPs
Authors: Hanxiao Liu, Zihang Dai, David So, Quoc V Le
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our comparisons show that self-attention is not critical for Vision Transformers, as g MLP can achieve the same accuracy. For BERT, our model achieves parity with Transformers on pretraining perplexity and is better on some downstream NLP tasks. On finetuning tasks where g MLP performs worse, making the g MLP model substantially larger can close the gap with Transformers. In general, our experiments show that g MLP can scale as well as Transformers over increased data and compute. |
| Researcher Affiliation | Industry | Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le Google Research, Brain Team {hanxiaol,zihangd,davidso,qvl}@google.com |
| Pseudocode | Yes | Pseudo-code for the g MLP block def gmlp_block(x, d_model, d_ffn): shortcut = x x = norm(x, axis="channel") x = proj(x, d_ffn, axis="channel") x = gelu(x) x = spatial_gating_unit(x) x = proj(x, d_model, axis="channel") return x + shortcut def spatial_gating_unit(x): u, v = split(x, axis="channel") v = norm(v, axis="channel") n = get_dim(v, axis="spatial") v = proj(v, n, axis="spatial", init_bias=1) return u v |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Here we examine g MLP in the vision domain by applying it to the image classification task on Image Net [31] without using extra data. |
| Dataset Splits | Yes | Here we examine g MLP in the vision domain by applying it to the image classification task on Image Net [31] without using extra data. For finetuning, we report the dev-set performance for SST-2 and MNLI in GLUE [41] and each result entry was obtained by taking the median of five independent runs. In addition, we report finetuning results on SQu AD [44, 45] to test the models ability in reasoning over a longer context. |
| Hardware Specification | No | The paper does not mention any specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, TensorFlow 2.x, or specific library versions). |
| Experiment Setup | Yes | With a standard 256-batch size 1M-step training setup as in original BERT... For ablations and case studies, all models are trained with batch size 2048, max length 128 for 125K steps... For main results, models are trained with batch size 256, max length 512 for 1M steps... We adjust only the strengths of stochastic depth [32] as we move from smaller to larger models in Table 1. All the other hyperparameters remain shared across our three models. |