SparseBERT: Rethinking the Importance Analysis in Self-attention
Authors: Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, James Tin-Yau Kwok
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm. In this experiment, we empirically study the effect of different positions in the self-attention module using the BERT-base. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology, Hong Kong 2The University of Hong Kong, Hong Kong 3Huawei Noah s Ark Lab 4Sun Yat-sen University, China. |
| Pseudocode | Yes | Algorithm 1 Differentiable Attention Mask (DAM). 1: initialize model parameter w and attention mask parameter α. 2: repeat 3: generate mask Mi,j gumbel-sigmoid(αi,j); 4: obtain the loss with attention mask L; 5: update parameter w and α simultaneously; 6: until convergence. 7: return attention mask M. |
| Open Source Code | Yes | The code is available at https://github.com/han-shi/Sparse BERT. |
| Open Datasets | Yes | Data sets Books Corpus (with 800M words) (Zhu et al., 2015) and English Wikipedia (with 2, 500M words) (Devlin et al., 2019) are used. The GLUE benchmark (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets. |
| Dataset Splits | Yes | We choose the best hyper-parameter combination on the development set and test it on the evaluation server. For MNLI sub-task, we experiment on both the matched (MNLI-m) and mismatched (MNLI-mm) sections. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9' or 'Python 3.8'). |
| Experiment Setup | Yes | This model is stacked with 12 Transformer blocks (Section 2.1) with the following hyper-parameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the filter size dff to 3072 as in (Devlin et al., 2019). The pre-training is performed for 40 epochs. The trade-off hyperparameter λ in (7) is varied in {10-1, 10-2, 10-3, 10-4}. |