PMI-Masking: Principled masking of correlated spans

Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that such uniform masking allows an MLM to minimize its training objective by latching onto shallow local signals, leading to pretraining inefficiency and suboptimal downstream performance. To address this flaw, we propose PMI-Masking... Specifically, we show experimentally that PMI-Masking reaches the performance of prior masking approaches in half the training time, and consistently improves performance at the end of training. Figure 1: SQu AD2.0 development set F1 scores of BERTBASE models trained with different masking schemes, evaluated every 200K steps during pretraining.
Researcher Affiliation Industry AI21 Labs, Tel Aviv, Israel {yoavl,barakl,opherl,omria,...}@ai21.com
Pseudocode No The paper describes algorithmic steps in narrative text (e.g., in Section 3.2 PMI: From Bigrams to n-grams, and 3.2.1 PMI-Masking) but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions external GitHub repositories (e.g., for HuggingFace tokenizers, original BERT) but does not provide a direct link or explicit statement about releasing the source code for the PMI-Masking method described in the paper.
Open Datasets Yes We trained uncased models with a 30K-sized vocabulary that we constructed over WIKIPEDIA+BOOKCORPUS via the Word Piece Tokenizer used in BERT. We show that PMI-Masking achieved even larger performance gains relative to the baselines when training over more data, by adding the 38GB OPENWEBTEXT (Gokaslan & Cohen, 2019) dataset, an open-source recreation of the Web Text corpus described in Radford et al. (2019).
Dataset Splits Yes We report the best median development set score over five random initializations per hyper-parameter. When applicable, the model with this score was evaluated on the test set. The development set score of each configuration was attained by fine-tuning the model over 4 epochs (SQuAD2.0 and RACE) or 3 epochs (all GLUE tasks except RTE and STS 10 epochs) and performing early stopping based on each task’s evaluation metric on the development set.
Hardware Specification No The paper does not explicitly specify the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions 'Word Piece Tokenizer' and refers to 'Transformer-based architecture', but does not provide specific version numbers for software dependencies or libraries used in the implementation.
Experiment Setup Yes We trained with a sequence length of 512 tokens, batch size of 256, and a varying number of steps detailed in Section 5. For pretraining, after a warmup of 10,000 steps we used a linear learning rate decay... Table 5: Hyper-parameters of the architecture and pretraining, complementing the description in Section 4. Number of Layers 12, Hidden Size 768, Sequence Length 512, FFN Inner Hidden Size 3072, Attention Heads 12, Attention Head Size 64, Dropout 0.1, Attention Dropout 0.1, Warmup Steps 10,000, Peak Learning Rate 1e-4, Batch Size 256, Weight Decay 0.01, Initializer Range 0.02, Learning Rate Decay Linear, Adam ϵ 1e-6, Adam β1 0.9, Adam β2 0.999.