Revisiting Over-smoothing in BERT from the Perspective of Graph
Authors: Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, James Kwok
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiment results on various data sets illustrate the effect of our fusion method. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology, 2The University of Hong Kong, 3Huawei Noah s Ark Lab, 4Sun Yat-sen University |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020). |
| Open Datasets | Yes | GLUE (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets. |
| Dataset Splits | Yes | we take the development set data of STS-B (Cer et al., 2017), Co LA (Warstadt et al., 2019), SQu AD (Rajpurkar et al., 2016) as input to the fine-tuned models |
| Hardware Specification | Yes | All experiments are performed on NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020). |
| Experiment Setup | Yes | The BERT model is stacked with 12 Transformer blocks (Section 2.1) with the following hyperparameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the filter size dff to 3072 as in Devlin et al. (2019). ... The hyper-parameters of various downstream tasks are shown in Table 4. |