reproducibilityindex.ai

Synthesizer: Rethinking Self-Attention for Transformer Models

Authors: Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via extensive experiments, we ﬁnd that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is useful but not that important after all. To this end, we propose SYNTHESIZER, a model that learns synthetic attention weights without token-token interactions. In our experiments, we ﬁrst show that simple Synthesizers achieve highly competitive performance when compared against vanilla Transformer models across a range of tasks, including machine translation, language modeling, text generation and GLUE/Super GLUE benchmarks.
Researcher Affiliation	Industry	1Google Research, Mountain View, California.
Pseudocode	No	The paper describes the proposed methods using mathematical formulas and text, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Implementation of our Synthesizer model is released at https://github.com/tensorflow/mesh.
Open Datasets	Yes	Specifically, we conduct experiments on (1) machine translation (En De, En Fr) (2) autoregressive language modeling (LM1B) (3) text generation (summarization and dialogue modeling and (4) multi-task natural language processing (GLUE/Super GLUE). ... C4 dataset (Raffel et al., 2019) ... AGnews (Zhang et al., 2015) and movie reviews (Maas et al., 2011).
Dataset Splits	No	The paper does not explicitly state the specific training, validation, or test split percentages or sample counts for the datasets used.
Hardware Specification	Yes	Experiments are conducted on Mesh Tensorﬂow (Shazeer et al., 2018) and ran on 2x2 TPU V3 Chips for approximately 524K steps.
Software Dependencies	No	The paper mentions 'Mesh TensorFlow' and implies the use of TensorFlow, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	No	The paper mentions that 'Details of each experiments can be found in the appendix', but it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or other detailed experimental setup information in the main text.