Synthesizer: Rethinking Self-Attention for Transformer Models
Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose SYNTHESIZER, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/Super GLUE benchmarks. |
| Researcher Affiliation | Industry | 1Google Research, Mountain View, California. |
| Pseudocode | No | The paper describes the proposed methods using mathematical formulas and text, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Implementation of our Synthesizer model is released at https://github.com/tensorflow/mesh. |
| Open Datasets | Yes | Specifically, we conduct experiments on (1) machine translation (En De, En Fr) (2) autoregressive language modeling (LM1B) (3) text generation (summarization and dialogue modeling and (4) multi-task natural language processing (GLUE/Super GLUE). ... C4 dataset (Raffel et al., 2019) ... AGnews (Zhang et al., 2015) and movie reviews (Maas et al., 2011). |
| Dataset Splits | No | The paper does not explicitly state the specific training, validation, or test split percentages or sample counts for the datasets used. |
| Hardware Specification | Yes | Experiments are conducted on Mesh Tensorflow (Shazeer et al., 2018) and ran on 2x2 TPU V3 Chips for approximately 524K steps. |
| Software Dependencies | No | The paper mentions 'Mesh TensorFlow' and implies the use of TensorFlow, but it does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper mentions that 'Details of each experiments can be found in the appendix', but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or other detailed experimental setup information in the main text. |