reproducibilityindex.ai

Revisiting Over-smoothing in BERT from the Perspective of Graph

Authors: Han Shi, JIAHUI GAO, Hang Xu, Xiaodan Liang, Zhenguo Li, Lingpeng Kong, Stephen M. S. Lee, James Kwok

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiment results on various data sets illustrate the effect of our fusion method.
Researcher Affiliation	Collaboration	1Hong Kong University of Science and Technology, 2The University of Hong Kong, 3Huawei Noah s Ark Lab, 4Sun Yat-sen University
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020).
Open Datasets	Yes	GLUE (Wang et al., 2018a), SWAG (Zellers et al., 2018) and SQu AD (Rajpurkar et al., 2016; 2018) data sets.
Dataset Splits	Yes	we take the development set data of STS-B (Cer et al., 2017), Co LA (Warstadt et al., 2019), SQu AD (Rajpurkar et al., 2016) as input to the ﬁne-tuned models
Hardware Specification	Yes	All experiments are performed on NVIDIA Tesla V100 GPUs.
Software Dependencies	No	Our implementation is based on the Hugging Face s Transformers library (Wolf et al., 2020).
Experiment Setup	Yes	The BERT model is stacked with 12 Transformer blocks (Section 2.1) with the following hyperparameters: number of tokens n = 128, number of self-attention heads h = 12, and hidden layer size d = 768. As for the feed-forward layer, we set the ﬁlter size dff to 3072 as in Devlin et al. (2019). ... The hyper-parameters of various downstream tasks are shown in Table 4.