reproducibilityindex.ai

On the Role of Attention Masks and LayerNorm in Transformers

Authors: Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we validate our theoretical findings via numerical experiments. Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python.
Researcher Affiliation	Academia	Xinyi Wu1 Amir Ajorlou1 Yifei Wang2 Stefanie Jegelka3,2 Ali Jadbabaie1 1MIT LIDS 2MIT CSAIL 3TU Munich {xinyiwu,ajorlou,yifei_w,stefje,jadbabai}@mit.edu
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Since all the experiments are for the verification purpose of our theory, they are straightforward and can be easily replicated given the instructions detailed in the paper.
Open Datasets	Yes	Following [12], we randomly select 3000 samples of 128-token excerpts (based on the BERT tokenizer) from Wikipedia using the Wikipedia API in Python.
Dataset Splits	No	The paper mentions using '3000 samples' but does not explicitly provide information on train, validation, or test dataset splits.
Hardware Specification	No	Compute We ran all of our experiments on CPUs.
Software Dependencies	No	All models were implemented with Py Torch [29] and Transformers library [36].
Experiment Setup	Yes	We use BERT [11] as the backbone transformer model and consider five different model variants... in a 128 layer randomly initialized BERT with 12 heads and 768 hidden dimension.