Rethinking Positional Encoding in Language Pre-training
Authors: Guolin Ke, Di He, Tie-Yan Liu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. |
| Researcher Affiliation | Industry | Guolin Ke, Di He & Tie-Yan Liu Microsoft Research {guolin.ke, dihe, tyliu}@microsoft.com |
| Pseudocode | No | The paper describes the proposed method using mathematical equations and text, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes and models are released at https://github.com/guolinke/TUPE. |
| Open Datasets | Yes | Following Devlin et al. (2018), we use the English Wikipedia corpus and Book Corpus (Zhu et al., 2015) for pre-training. ... We use the GLUE (General Language Understanding Evaluation) dataset (Wang et al., 2018) as the downstream tasks to evaluate the performance of the pre-trained models. |
| Dataset Splits | Yes | Each configuration will be run five times with different random seeds, and the median of these five results on the development set will be used as the performance of one configuration. ... Table 1: GLUE scores on dev set. |
| Hardware Specification | Yes | All models are run on 16 NVIDIA Tesla V100 GPUs with mixed-precision (Micikevicius et al., 2017). |
| Software Dependencies | No | The paper states 'All codes are implemented based on fairseq (Ott et al., 2019) in Py Torch (Paszke et al., 2017)' and 'We use Adam (Kingma & Ba, 2014) as the optimizer', but it does not specify version numbers for PyTorch, fairseq, or any other libraries. |
| Experiment Setup | Yes | We train the models for 1000k steps where the batch size is 256 and the maximum sequence length is 512. ... We use Adam (Kingma & Ba, 2014) as the optimizer, and set the its hyperparameter ϵ to 1e-6 and (β1, β2) to (0.9, 0.999). The peak learning rate is set to 1e-4 with a 10k-step warm-up stage. ... We set the dropout probability to 0.1, gradient clip norm to 1.0, and weight decay to 0.01. The setting details are listed in Table 2. |