Revisiting Over-smoothing in BERT from the Perspective of Graph

Authors: Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, James Kwok

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiment results on various data sets illustrate the effect of our fusion method.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology, 2The University of Hong Kong, 3Huawei Noah s Ark Lab, 4Sun Yat-sen University
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020).
Open Datasets Yes GLUE (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets.
Dataset Splits Yes we take the development set data of STS-B (Cer et al., 2017), Co LA (Warstadt et al., 2019), SQu AD (Rajpurkar et al., 2016) as input to the fine-tuned models
Hardware Specification Yes All experiments are performed on NVIDIA Tesla V100 GPUs.
Software Dependencies No Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020).
Experiment Setup Yes The BERT model is stacked with 12 Transformer blocks (Section 2.1) with the following hyperparameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the filter size dff to 3072 as in Devlin et al. (2019). ... The hyper-parameters of various downstream tasks are shown in Table 4.