Representation Deficiency in Masked Language Modeling

Authors: Yu Meng, Jitin Krishnan, Sinong Wang, Qifan Wang, Yuning Mao, Han Fang, Marjan Ghazvininejad, Jiawei Han, Luke Zettlemoyer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models on the GLUE and SQu AD benchmarks. 4 EXPERIMENTS 4.2 OVERALL RESULTS Table 1 shows the results under the two base model pretraining settings on the GLUE and SQu AD 2.0 benchmarks.
Researcher Affiliation Collaboration 1University of Illinois Urbana-Champaign 2Meta AI 1{yumeng5, hanj}@illinois.edu 2{jitinkrishnan, sinongwang, wqfcr, yuningm, hanfang, ghazvini, lsz}@meta.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code can be found at https://github.com/yumeng5/MAE-LM.
Open Datasets Yes We evaluate the pretrained models on the GLUE (Wang et al., 2018) and SQu AD 2.0 (Rajpurkar et al., 2018) benchmarks. The base setting uses 16GB training corpus following BERT (Devlin et al., 2019) while the base++ setting uses 160GB training corpus following Ro BERTa (Liu et al., 2019).
Dataset Splits Yes All reported fine-tuning results are the medians of five random seeds on GLUE and SQu AD, following previous studies (Liu et al., 2019). The hyperparameter search space for fine-tuning can be found in Appendix D.
Hardware Specification Yes The experiments in this paper are conducted on 64 A100 GPUs.
Software Dependencies No The paper mentions software components but does not provide specific version numbers, e.g., "We train both absolute and relative position embeddings (Raffel et al., 2019) in the encoder. The vocabulary is constructed with BPE (Sennrich et al., 2015)" without mentioning versions of the BPE library or general software environment like Python/PyTorch versions.
Experiment Setup Yes Pretraining Settings. We evaluate MAE-LM mainly under the base model scale for two pretraining settings: base and base++. Both settings pretrain 12-layer Transformers with 768 model dimensions. The base setting uses 16GB training corpus following BERT (Devlin et al., 2019) while the base++ setting uses 160GB training corpus following Ro BERTa (Liu et al., 2019). The details can be found in Appendix D. Table 3: Hyperparameters used in pretraining.