UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training
Authors: Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, Hsiao-Wuen Hon
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of language understanding and generation tasks across several widely used benchmarks. |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology 2Microsoft Research. |
| Pseudocode | Yes | Algorithm 1 Blockwise Masking |
| Open Source Code | Yes | The code and pre-trained models are available at https://github.com/ microsoft/unilm. |
| Open Datasets | Yes | We use 160GB text corpora from English Wikipedia1, Book Corpus (Zhu et al., 2015), Open Web Text2, CC-News (Liu et al., 2019), and Stories (Trinh & Le, 2018). |
| Dataset Splits | Yes | We compare previous BASE-size models with PMLM. Notice that the publicly available BERTBASE checkpoint (Devlin et al., 2018) is pre-trained on 13GB corpora with 256 batch size, while XLNet BASE and Ro BERTa BASE are more directly comparable. The results show that UNILMv2BASE achieves better performance than the other models on both SQu AD datasets. Table 3 presents the results on GLUE. |
| Hardware Specification | Yes | We ran the pre-training procedure for 0.5 million steps, which took about 20 days using 64 Nvidia V100-32GB GPU cards. |
| Software Dependencies | No | The paper mentions using "Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98, and ϵ = 1e-6 for optimization" but does not specify version numbers for any key software libraries or frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | Specifically, we used a 12-layer Transformer with 12 attention heads. The hidden size was 768, and inner hidden size of feed-forward network was 3072. The weight matrix of the softmax classifier was tied with the token embedding matrix... The token masking probability was 15%. Among masked positions, 80% of the time we replaced the token with masks, 10% of the time with a random token, and keeping the original token for the rest... The batch size was set to 7680. We used Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98, and ϵ = 1e-6 for optimization. The peak learning rate was set to 6e-4, with linear warmup over the first 24, 000 steps and linear decay. The weight decay was 0.01. The dropout rate was set to 0.1. We ran the pre-training procedure for 0.5 million steps. |