ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
Researcher Affiliation Collaboration Kevin Clark Stanford University kevclark@cs.stanford.edu Minh-Thang Luong Google Brain thangluong@google.com Quoc V. Le Google Brain qvl@google.com Christopher D. Manning Stanford University & CIFAR Fellow manning@cs.stanford.edu
Pseudocode No The paper describes the model architecture and training process using mathematical equations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Code and pre-trained weights will be released at https://github.com/google-research/electra
Open Datasets Yes For most experiments we pre-train on the same data as BERT, which consists of 3.3 Billion tokens from Wikipedia and Books Corpus (Zhu et al., 2015). However, for our Large model we pre-trained on the data used for XLNet (Yang et al., 2019), which extends the BERT dataset to 33B tokens by including data from Clue Web (Callan et al., 2009), Common Crawl, and Gigaword (Parker et al., 2011).
Dataset Splits Yes We evaluate on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and Stanford Question Answering (SQu AD) dataset (Rajpurkar et al., 2016). Unless stated otherwise, results are on the dev set.
Hardware Specification Yes Train Time + Hardware: 4d on 1 V100 GPU, 4d on 16 TPUv3s
Software Dependencies No The paper mentions 'Tensor Flow’s FLOP-counting capabilities' but does not specify a version number for TensorFlow or any other software dependencies.
Experiment Setup Yes The full set of hyperparameters are listed in Table 6. (e.g., Number of layers 12, Hidden Size 256, Learning Rate 5e-4, Batch Size 128, Train Steps 1.45M/1M for Small models) and The full set of hyperparameters is listed in Table 7. (e.g., Learning Rate 3e-4, Batch Size 32, Train Epochs 10 for GLUE)