reproducibilityindex.ai

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

Authors: Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, Hsiao-Wuen Hon

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that the uniﬁed language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of language understanding and generation tasks across several widely used benchmarks.
Researcher Affiliation	Collaboration	1Harbin Institute of Technology 2Microsoft Research.
Pseudocode	Yes	Algorithm 1 Blockwise Masking
Open Source Code	Yes	The code and pre-trained models are available at https://github.com/ microsoft/unilm.
Open Datasets	Yes	We use 160GB text corpora from English Wikipedia1, Book Corpus (Zhu et al., 2015), Open Web Text2, CC-News (Liu et al., 2019), and Stories (Trinh & Le, 2018).
Dataset Splits	Yes	We compare previous BASE-size models with PMLM. Notice that the publicly available BERTBASE checkpoint (Devlin et al., 2018) is pre-trained on 13GB corpora with 256 batch size, while XLNet BASE and Ro BERTa BASE are more directly comparable. The results show that UNILMv2BASE achieves better performance than the other models on both SQu AD datasets. Table 3 presents the results on GLUE.
Hardware Specification	Yes	We ran the pre-training procedure for 0.5 million steps, which took about 20 days using 64 Nvidia V100-32GB GPU cards.
Software Dependencies	No	The paper mentions using "Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98, and ϵ = 1e-6 for optimization" but does not specify version numbers for any key software libraries or frameworks (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	Speciﬁcally, we used a 12-layer Transformer with 12 attention heads. The hidden size was 768, and inner hidden size of feed-forward network was 3072. The weight matrix of the softmax classiﬁer was tied with the token embedding matrix... The token masking probability was 15%. Among masked positions, 80% of the time we replaced the token with masks, 10% of the time with a random token, and keeping the original token for the rest... The batch size was set to 7680. We used Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98, and ϵ = 1e-6 for optimization. The peak learning rate was set to 6e-4, with linear warmup over the ﬁrst 24, 000 steps and linear decay. The weight decay was 0.01. The dropout rate was set to 0.1. We ran the pre-training procedure for 0.5 million steps.