On the Role of Attention Masks and LayerNorm in Transformers
Authors: Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we validate our theoretical findings via numerical experiments. Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python. |
| Researcher Affiliation | Academia | Xinyi Wu1 Amir Ajorlou1 Yifei Wang2 Stefanie Jegelka3,2 Ali Jadbabaie1 1MIT LIDS 2MIT CSAIL 3TU Munich {xinyiwu,ajorlou,yifei_w,stefje,jadbabai}@mit.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Since all the experiments are for the verification purpose of our theory, they are straightforward and can be easily replicated given the instructions detailed in the paper. |
| Open Datasets | Yes | Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python. |
| Dataset Splits | No | The paper mentions using '3000 samples' but does not explicitly provide information on train, validation, or test dataset splits. |
| Hardware Specification | No | Compute We ran all of our experiments on CPUs. |
| Software Dependencies | No | All models were implemented with Py Torch [29] and Transformers library [36]. |
| Experiment Setup | Yes | We use BERT [11] as the backbone transformer model and consider five different model variants... in a 128 layer randomly initialized BERT with 12 heads and 768 hidden dimension. |