Representation Deficiency in Masked Language Modeling
Authors: Yu Meng, Jitin Krishnan, Sinong Wang, Qifan Wang, Yuning Mao, Han Fang, Marjan Ghazvininejad, Jiawei Han, Luke Zettlemoyer
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we show that MAE-LM improves the utilization of model dimensions for real token representations, and MAE-LM consistently outperforms MLM-pretrained models on the GLUE and SQu AD benchmarks. 4 EXPERIMENTS 4.2 OVERALL RESULTS Table 1 shows the results under the two base model pretraining settings on the GLUE and SQu AD 2.0 benchmarks. |
| Researcher Affiliation | Collaboration | 1University of Illinois Urbana-Champaign 2Meta AI 1{yumeng5, hanj}@illinois.edu 2{jitinkrishnan, sinongwang, wqfcr, yuningm, hanfang, ghazvini, lsz}@meta.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code can be found at https://github.com/yumeng5/MAE-LM. |
| Open Datasets | Yes | We evaluate the pretrained models on the GLUE (Wang et al., 2018) and SQu AD 2.0 (Rajpurkar et al., 2018) benchmarks. The base setting uses 16GB training corpus following BERT (Devlin et al., 2019) while the base++ setting uses 160GB training corpus following Ro BERTa (Liu et al., 2019). |
| Dataset Splits | Yes | All reported fine-tuning results are the medians of five random seeds on GLUE and SQu AD, following previous studies (Liu et al., 2019). The hyperparameter search space for fine-tuning can be found in Appendix D. |
| Hardware Specification | Yes | The experiments in this paper are conducted on 64 A100 GPUs. |
| Software Dependencies | No | The paper mentions software components but does not provide specific version numbers, e.g., "We train both absolute and relative position embeddings (Raffel et al., 2019) in the encoder. The vocabulary is constructed with BPE (Sennrich et al., 2015)" without mentioning versions of the BPE library or general software environment like Python/PyTorch versions. |
| Experiment Setup | Yes | Pretraining Settings. We evaluate MAE-LM mainly under the base model scale for two pretraining settings: base and base++. Both settings pretrain 12-layer Transformers with 768 model dimensions. The base setting uses 16GB training corpus following BERT (Devlin et al., 2019) while the base++ setting uses 160GB training corpus following Ro BERTa (Liu et al., 2019). The details can be found in Appendix D. Table 3: Hyperparameters used in pretraining. |