Synthesizer: Rethinking Self-Attention for Transformer Models

Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose SYNTHESIZER, a model that learns synthetic attention weights without token-token interactions. In our experiments, we first show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/Super GLUE benchmarks.
Researcher Affiliation Industry 1Google Research, Mountain View, California.
Pseudocode No The paper describes the proposed methods using mathematical formulas and text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Implementation of our Synthesizer model is released at https://github.com/tensorflow/mesh.
Open Datasets Yes Specifically, we conduct experiments on (1) machine translation (En De, En Fr) (2) autoregressive language modeling (LM1B) (3) text generation (summarization and dialogue modeling and (4) multi-task natural language processing (GLUE/Super GLUE). ... C4 dataset (Raffel et al., 2019) ... AGnews (Zhang et al., 2015) and movie reviews (Maas et al., 2011).
Dataset Splits No The paper does not explicitly state the specific training, validation, or test split percentages or sample counts for the datasets used.
Hardware Specification Yes Experiments are conducted on Mesh Tensorflow (Shazeer et al., 2018) and ran on 2x2 TPU V3 Chips for approximately 524K steps.
Software Dependencies No The paper mentions 'Mesh TensorFlow' and implies the use of TensorFlow, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup No The paper mentions that 'Details of each experiments can be found in the appendix', but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or other detailed experimental setup information in the main text.