Unified Language Model Pre-training for Natural Language Understanding and Generation

Authors: Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted experiments on both NLU (i.e., the GLUE benchmark, and extractive question answering) and NLG tasks (i.e., abstractive summarization, question generation, generative question answering, and dialog response generation).
Researcher Affiliation Industry Li Dong Nan Yang Wenhui Wang Furu Wei Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com
Pseudocode No No
Open Source Code Yes The code and pre-trained models are available at https://github.com/microsoft/unilm.
Open Datasets Yes UNILM is initialized by BERTLARGE, and then pre-trained using English Wikipedia2 and Book Corpus [52], which have been processed in the same way as [9]. We use the non-anonymized version of the CNN/Daily Mail dataset [36] and Gigaword [35] for model fine-tuning and evaluation. We conduct experiments on the Stanford Question Answering Dataset (SQu AD) 2.0 [33], and Conversational Question Answering (Co QA) [34] datasets. We evaluate UNILM on the General Language Understanding Evaluation (GLUE) benchmark [44].
Dataset Splits Yes We split the original training set into training and test sets, and keep the original development set.
Hardware Specification Yes It takes about 7 hours for 10, 000 steps using 8 Nvidia Telsa V100 32GB GPU cards with mixed precision training.
Software Dependencies No No
Experiment Setup Yes Specifically, we use a 24-layer Transformer with 1, 024 hidden size, and 16 attention heads, which contains about 340M parameters. The maximum length of input sequence is 512. The token masking probability is 15%. The learning rate is 3e-5, with linear warmup over the first 40, 000 steps and linear decay. The dropout rate is 0.1. The weight decay is 0.01. The batch size is 330. The pre-training procedure runs for about 770, 000 steps.