SparseBERT: Rethinking the Importance Analysis in Self-attention

Authors: Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James Tin-Yau Kwok

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm. In this experiment, we empirically study the effect of different positions in the self-attention module using the BERT-base.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology, Hong Kong 2The University of Hong Kong, Hong Kong 3Huawei Noah s Ark Lab 4Sun Yat-sen University, China.
Pseudocode Yes Algorithm 1 Differentiable Attention Mask (DAM). 1: initialize model parameter w and attention mask parameter α. 2: repeat 3: generate mask Mi,j gumbel-sigmoid(αi,j); 4: obtain the loss with attention mask L; 5: update parameter w and α simultaneously; 6: until convergence. 7: return attention mask M.
Open Source Code Yes The code is available at https://github.com/han-shi/Sparse BERT.
Open Datasets Yes Data sets Books Corpus (with 800M words) (Zhu et al., 2015) and English Wikipedia (with 2, 500M words) (Devlin et al., 2019) are used. The GLUE benchmark (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets.
Dataset Splits Yes We choose the best hyper-parameter combination on the development set and test it on the evaluation server. For MNLI sub-task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) sections.
Hardware Specification Yes All experiments are performed on NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'Python 3.8').
Experiment Setup Yes This model is stacked with 12 Transformer blocks (Section 2.1) with the following hyper-parameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the filter size dff to 3072 as in (Devlin et al., 2019). The pre-training is performed for 40 epochs. The trade-off hyperparameter λ in (7) is varied in {10-1, 10-2, 10-3, 10-4}.