On Position Embeddings in BERT

Authors: Benyou Wang, Lifeng Shang, Christina Lioma, Xin Jiang, Hao Yang, Qun Liu, Jakob Grue Simonsen

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An empirical evaluation of seven PEs (and their combinations) for classification (GLUE) and span prediction (SQu AD) shows that: (1) both classification and span prediction benefit from translation invariance and local monotonicity, while symmetry slightly decreases performance; (2) The fully-learnable absolute PE performs better in classification, while relative PEs perform better in span prediction. We contribute the first formal and quantitative analysis of desiderata for PEs, and a principled discussion about their correlation to the performance of typical downstream tasks.
Researcher Affiliation Collaboration Benyou Wang University of Padova wang@dei.unipd.it Lifeng Shang Huawei Noah s Ark Lab Shang.Lifeng@huawei.com Christina Lioma University of Copenhagen c.lioma@di.ku.dk Xin Jiang Huawei Noah s Ark Lab Jiang.Xin@huawei.com Hao Yang Huawei Technologies Co., Ltd. yanghao30@huawei.com Qun Liu Huawei Noah s Ark Lab qun.liu@huawei.com Jakob Grue Simonsen University of Copenhagen simonsen@di.ku.dk
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not provide an explicit statement about releasing source code for their methodology, nor does it provide a link to a code repository.
Open Datasets Yes For classification, we use the GLUE (Wang et al., 2018) benchmark, which includes datasets for both single document classification and sentence pair classification. For span prediction, we use the SQu AD V1.1 and V2.0 datasets consisting of 100k crowdsourced question/answer pairs (Rajpurkar et al., 2016).
Dataset Splits Yes We benchmark 13 PEs (including APEs, RPEs, and their combinations) in GLUE and SQu AD, in a total of 11 individual tasks. For classification, we use the GLUE (Wang et al., 2018) benchmark... For span prediction, we use the SQu AD V1.1 and V2.0 datasets... We report the average values of five runs per dataset. Table 4: Performance (average and standard deviation in 5 runs) on dev of SQu AD V1.1 and V2.0.
Hardware Specification No The paper does not explicitly state the specific hardware used (e.g., GPU models, CPU models, memory specifications) for running its experiments.
Software Dependencies No The paper mentions fine-tuning using the Huggingface website/tool by Wolf et al. (2019), which implies the use of the Transformers library. However, no specific version numbers for software dependencies (e.g., Python, PyTorch/TensorFlow, Transformers library version) are provided.
Experiment Setup Yes We train the new models with a sequence length of 128 for 5 epochs and then 512 for another 2 epochs. The training is the same as in the original BERT, i.e., Books Corpus and Wikipedia (16G raw documents) with whole word masking. All models have about 110M parameters corresponding to a typical base setting, with minor differences solely depending on the parameterization in Tab. 1. Table 6: Detailed Experimental Settings