On the Role of Attention Masks and LayerNorm in Transformers

Authors: Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we validate our theoretical findings via numerical experiments. Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python.
Researcher Affiliation Academia Xinyi Wu1 Amir Ajorlou1 Yifei Wang2 Stefanie Jegelka3,2 Ali Jadbabaie1 1MIT LIDS 2MIT CSAIL 3TU Munich {xinyiwu,ajorlou,yifei_w,stefje,jadbabai}@mit.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Since all the experiments are for the verification purpose of our theory, they are straightforward and can be easily replicated given the instructions detailed in the paper.
Open Datasets Yes Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python.
Dataset Splits No The paper mentions using '3000 samples' but does not explicitly provide information on train, validation, or test dataset splits.
Hardware Specification No Compute We ran all of our experiments on CPUs.
Software Dependencies No All models were implemented with Py Torch [29] and Transformers library [36].
Experiment Setup Yes We use BERT [11] as the backbone transformer model and consider five different model variants... in a 128 layer randomly initialized BERT with 12 heads and 768 hidden dimension.