reproducibilityindex.ai

Unified Language Model Pre-training for Natural Language Understanding and Generation

Authors: Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We have conducted experiments on both NLU (i.e., the GLUE benchmark, and extractive question answering) and NLG tasks (i.e., abstractive summarization, question generation, generative question answering, and dialog response generation).
Researcher Affiliation	Industry	Li Dong Nan Yang Wenhui Wang Furu Wei Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com
Pseudocode	No	No
Open Source Code	Yes	The code and pre-trained models are available at https://github.com/microsoft/unilm.
Open Datasets	Yes	UNILM is initialized by BERTLARGE, and then pre-trained using English Wikipedia2 and Book Corpus [52], which have been processed in the same way as [9]. We use the non-anonymized version of the CNN/Daily Mail dataset [36] and Gigaword [35] for model ﬁne-tuning and evaluation. We conduct experiments on the Stanford Question Answering Dataset (SQu AD) 2.0 [33], and Conversational Question Answering (Co QA) [34] datasets. We evaluate UNILM on the General Language Understanding Evaluation (GLUE) benchmark [44].
Dataset Splits	Yes	We split the original training set into training and test sets, and keep the original development set.
Hardware Specification	Yes	It takes about 7 hours for 10, 000 steps using 8 Nvidia Telsa V100 32GB GPU cards with mixed precision training.
Software Dependencies	No	No
Experiment Setup	Yes	Speciﬁcally, we use a 24-layer Transformer with 1, 024 hidden size, and 16 attention heads, which contains about 340M parameters. The maximum length of input sequence is 512. The token masking probability is 15%. The learning rate is 3e-5, with linear warmup over the ﬁrst 40, 000 steps and linear decay. The dropout rate is 0.1. The weight decay is 0.01. The batch size is 330. The pre-training procedure runs for about 770, 000 steps.