Unified Language Model Pre-training for Natural Language Understanding and Generation
Authors: Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted experiments on both NLU (i.e., the GLUE benchmark, and extractive question answering) and NLG tasks (i.e., abstractive summarization, question generation, generative question answering, and dialog response generation). |
| Researcher Affiliation | Industry | Li Dong Nan Yang Wenhui Wang Furu Wei Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com |
| Pseudocode | No | No |
| Open Source Code | Yes | The code and pre-trained models are available at https://github.com/microsoft/unilm. |
| Open Datasets | Yes | UNILM is initialized by BERTLARGE, and then pre-trained using English Wikipedia2 and Book Corpus [52], which have been processed in the same way as [9]. We use the non-anonymized version of the CNN/Daily Mail dataset [36] and Gigaword [35] for model fine-tuning and evaluation. We conduct experiments on the Stanford Question Answering Dataset (SQu AD) 2.0 [33], and Conversational Question Answering (Co QA) [34] datasets. We evaluate UNILM on the General Language Understanding Evaluation (GLUE) benchmark [44]. |
| Dataset Splits | Yes | We split the original training set into training and test sets, and keep the original development set. |
| Hardware Specification | Yes | It takes about 7 hours for 10, 000 steps using 8 Nvidia Telsa V100 32GB GPU cards with mixed precision training. |
| Software Dependencies | No | No |
| Experiment Setup | Yes | Specifically, we use a 24-layer Transformer with 1, 024 hidden size, and 16 attention heads, which contains about 340M parameters. The maximum length of input sequence is 512. The token masking probability is 15%. The learning rate is 3e-5, with linear warmup over the first 40, 000 steps and linear decay. The dropout rate is 0.1. The weight decay is 0.01. The batch size is 330. The pre-training procedure runs for about 770, 000 steps. |