reproducibilityindex.ai

Rethinking Positional Encoding in Language Pre-training

Authors: Guolin Ke, Di He, Tie-Yan Liu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method.
Researcher Affiliation	Industry	Guolin Ke, Di He & Tie-Yan Liu Microsoft Research {guolin.ke, dihe, tyliu}@microsoft.com
Pseudocode	No	The paper describes the proposed method using mathematical equations and text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Codes and models are released at https://github.com/guolinke/TUPE.
Open Datasets	Yes	Following Devlin et al. (2018), we use the English Wikipedia corpus and Book Corpus (Zhu et al., 2015) for pre-training. ... We use the GLUE (General Language Understanding Evaluation) dataset (Wang et al., 2018) as the downstream tasks to evaluate the performance of the pre-trained models.
Dataset Splits	Yes	Each conﬁguration will be run ﬁve times with different random seeds, and the median of these ﬁve results on the development set will be used as the performance of one conﬁguration. ... Table 1: GLUE scores on dev set.
Hardware Specification	Yes	All models are run on 16 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017).
Software Dependencies	No	The paper states 'All codes are implemented based on fairseq (Ott et al., 2019) in Py Torch (Paszke et al., 2017)' and 'We use Adam (Kingma & Ba, 2014) as the optimizer', but it does not specify version numbers for PyTorch, fairseq, or any other libraries.
Experiment Setup	Yes	We train the models for 1000k steps where the batch size is 256 and the maximum sequence length is 512. ... We use Adam (Kingma & Ba, 2014) as the optimizer, and set the its hyperparameter ϵ to 1e-6 and (β1, β2) to (0.9, 0.999). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. ... We set the dropout probability to 0.1, gradient clip norm to 1.0, and weight decay to 0.01. The setting details are listed in Table 2.