MPNet: Masked and Permuted Pre-training for Language Understanding

Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQu AD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, Ro BERTa) under the same model setting.
Researcher Affiliation Collaboration Kaitao Song1, Xu Tan2, Tao Qin2, Jianfeng Lu1, Tie-Yan Liu2 1Nanjing University of Science and Technology, 2Microsoft Research {kt.song,lujf}@njust.edu.cn, {xuta,taoqin,tyliu}@microsoft.com
Pseudocode No The paper describes the proposed method using text and figures (Figure 2, 3), but it does not include a block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.
Open Datasets Yes We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQu AD, etc). ... For pre-training corpus, we follow the data used in Ro BERTa [7], which includes Wikipedia and Books Corpus [15], Open Web Text [16], CC-News [17] and Stories [18], with 160GB data size in total.
Dataset Splits Yes We fine-tune on a variety of down-streaming benchmark tasks, including GLUE, SQu AD, RACE and IMDB. ... On the dev set of GLUE tasks, MPNet outperforms BERT [2], XLNet [5] and Ro BERTa [7] by 4.8, 3.4, 1.5 points on average. ... The Stanford Question Answering Dataset (SQu AD) task... We evaluate our model on SQu AD v1.1 [26] dev set and SQu AD v2.0 [29] dev/test set...
Hardware Specification Yes We use 32 NVIDIA Tesla 32GB V100 GPUs, with FP16 for speedup.
Software Dependencies No The paper mentions 'Adam [19] with β1 = 0.9, β2 = 0.98 and ϵ = 1e 6, and weight decay is set as 0.01' as an optimizer, but it does not specify software dependencies like programming languages (e.g., Python version) or specific library versions (e.g., PyTorch version).
Experiment Setup Yes We conduct experiments under the BERT base setting (BERTBASE) [2], where the model consists of 12 transformer layers, with 768 hidden size, 12 attention heads, and 110M model parameters in total. For the pre-training objective of MPNet, we randomly permute the sentence following PLM [5]5, choose the rightmost 15% tokens as the predicted tokens, and prepare mask tokens following the same 8:1:1 replacement strategy in BERT [2]. Additionally, we also apply whole word mask [12] and relative positional embedding [13]6 in our model pre-training... We use Adam [19] with β1 = 0.9, β2 = 0.98 and ϵ = 1e 6, and weight decay is set as 0.01. We pre-train our model for 500K steps... We use a sub-word dictionary with 30K BPE codes... we limit the length of sentences in each batch as up to 512 tokens and use a batch size of 8192 sentences.