MPNet: Masked and Permuted Pre-training for Language Understanding
Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQu AD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, Ro BERTa) under the same model setting. |
| Researcher Affiliation | Collaboration | Kaitao Song1, Xu Tan2, Tao Qin2, Jianfeng Lu1, Tie-Yan Liu2 1Nanjing University of Science and Technology, 2Microsoft Research {kt.song,lujf}@njust.edu.cn, {xuta,taoqin,tyliu}@microsoft.com |
| Pseudocode | No | The paper describes the proposed method using text and figures (Figure 2, 3), but it does not include a block labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code and the pre-trained models are available at: https://github.com/microsoft/MPNet. |
| Open Datasets | Yes | We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQu AD, etc). ... For pre-training corpus, we follow the data used in Ro BERTa [7], which includes Wikipedia and Books Corpus [15], Open Web Text [16], CC-News [17] and Stories [18], with 160GB data size in total. |
| Dataset Splits | Yes | We fine-tune on a variety of down-streaming benchmark tasks, including GLUE, SQu AD, RACE and IMDB. ... On the dev set of GLUE tasks, MPNet outperforms BERT [2], XLNet [5] and Ro BERTa [7] by 4.8, 3.4, 1.5 points on average. ... The Stanford Question Answering Dataset (SQu AD) task... We evaluate our model on SQu AD v1.1 [26] dev set and SQu AD v2.0 [29] dev/test set... |
| Hardware Specification | Yes | We use 32 NVIDIA Tesla 32GB V100 GPUs, with FP16 for speedup. |
| Software Dependencies | No | The paper mentions 'Adam [19] with β1 = 0.9, β2 = 0.98 and ϵ = 1e 6, and weight decay is set as 0.01' as an optimizer, but it does not specify software dependencies like programming languages (e.g., Python version) or specific library versions (e.g., PyTorch version). |
| Experiment Setup | Yes | We conduct experiments under the BERT base setting (BERTBASE) [2], where the model consists of 12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total. For the pre-training objective of MPNet, we randomly permute the sentence following PLM [5]5, choose the rightmost 15% tokens as the predicted tokens, and prepare mask tokens following the same 8:1:1 replacement strategy in BERT [2]. Additionally, we also apply whole word mask [12] and relative positional embedding [13]6 in our model pre-training... We use Adam [19] with β1 = 0.9, β2 = 0.98 and ϵ = 1e 6, and weight decay is set as 0.01. We pre-train our model for 500K steps... We use a sub-word dictionary with 30K BPE codes... we limit the length of sentences in each batch as up to 512 tokens and use a batch size of 8192 sentences. |