reproducibilityindex.ai

On Position Embeddings in BERT

Authors: Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, Jakob Grue Simonsen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An empirical evaluation of seven PEs (and their combinations) for classiﬁcation (GLUE) and span prediction (SQu AD) shows that: (1) both classiﬁcation and span prediction beneﬁt from translation invariance and local monotonicity, while symmetry slightly decreases performance; (2) The fully-learnable absolute PE performs better in classiﬁcation, while relative PEs perform better in span prediction. We contribute the ﬁrst formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.
Researcher Affiliation	Collaboration	Benyou Wang University of Padova wang@dei.unipd.it Lifeng Shang Huawei Noah s Ark Lab Shang.Lifeng@huawei.com Christina Lioma University of Copenhagen c.lioma@di.ku.dk Xin Jiang Huawei Noah s Ark Lab Jiang.Xin@huawei.com Hao Yang Huawei Technologies Co., Ltd. yanghao30@huawei.com Qun Liu Huawei Noah s Ark Lab qun.liu@huawei.com Jakob Grue Simonsen University of Copenhagen simonsen@di.ku.dk
Pseudocode	No	No pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for their methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	For classiﬁcation, we use the GLUE (Wang et al., 2018) benchmark, which includes datasets for both single document classiﬁcation and sentence pair classiﬁcation. For span prediction, we use the SQu AD V1.1 and V2.0 datasets consisting of 100k crowdsourced question/answer pairs (Rajpurkar et al., 2016).
Dataset Splits	Yes	We benchmark 13 PEs (including APEs, RPEs, and their combinations) in GLUE and SQu AD, in a total of 11 individual tasks. For classiﬁcation, we use the GLUE (Wang et al., 2018) benchmark... For span prediction, we use the SQu AD V1.1 and V2.0 datasets... We report the average values of ﬁve runs per dataset. Table 4: Performance (average and standard deviation in 5 runs) on dev of SQu AD V1.1 and V2.0.
Hardware Specification	No	The paper does not explicitly state the specific hardware used (e.g., GPU models, CPU models, memory specifications) for running its experiments.
Software Dependencies	No	The paper mentions fine-tuning using the Huggingface website/tool by Wolf et al. (2019), which implies the use of the Transformers library. However, no specific version numbers for software dependencies (e.g., Python, PyTorch/TensorFlow, Transformers library version) are provided.
Experiment Setup	Yes	We train the new models with a sequence length of 128 for 5 epochs and then 512 for another 2 epochs. The training is the same as in the original BERT, i.e., Books Corpus and Wikipedia (16G raw documents) with whole word masking. All models have about 110M parameters corresponding to a typical base setting, with minor differences solely depending on the parameterization in Tab. 1. Table 6: Detailed Experimental Settings