reproducibilityindex.ai

SparseBERT: Rethinking the Importance Analysis in Self-attention

Authors: Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James Tin-Yau Kwok

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments verify our interesting ﬁndings and illustrate the effect of the proposed algorithm. In this experiment, we empirically study the effect of different positions in the self-attention module using the BERT-base.
Researcher Affiliation	Collaboration	1Hong Kong University of Science and Technology, Hong Kong 2The University of Hong Kong, Hong Kong 3Huawei Noah s Ark Lab 4Sun Yat-sen University, China.
Pseudocode	Yes	Algorithm 1 Differentiable Attention Mask (DAM). 1: initialize model parameter w and attention mask parameter α. 2: repeat 3: generate mask Mi,j gumbel-sigmoid(αi,j); 4: obtain the loss with attention mask L; 5: update parameter w and α simultaneously; 6: until convergence. 7: return attention mask M.
Open Source Code	Yes	The code is available at https://github.com/han-shi/Sparse BERT.
Open Datasets	Yes	Data sets Books Corpus (with 800M words) (Zhu et al., 2015) and English Wikipedia (with 2, 500M words) (Devlin et al., 2019) are used. The GLUE benchmark (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets.
Dataset Splits	Yes	We choose the best hyper-parameter combination on the development set and test it on the evaluation server. For MNLI sub-task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) sections.
Hardware Specification	Yes	All experiments are performed on NVIDIA Tesla V100 GPUs.
Software Dependencies	No	The paper does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'Python 3.8').
Experiment Setup	Yes	This model is stacked with 12 Transformer blocks (Section 2.1) with the following hyper-parameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the ﬁlter size dff to 3072 as in (Devlin et al., 2019). The pre-training is performed for 40 epochs. The trade-off hyperparameter λ in (7) is varied in {10-1, 10-2, 10-3, 10-4}.