ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. |
| Researcher Affiliation | Collaboration | Kevin Clark Stanford University kevclark@cs.stanford.edu Minh-Thang Luong Google Brain thangluong@google.com Quoc V. Le Google Brain qvl@google.com Christopher D. Manning Stanford University & CIFAR Fellow manning@cs.stanford.edu |
| Pseudocode | No | The paper describes the model architecture and training process using mathematical equations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Code and pre-trained weights will be released at https://github.com/google-research/electra |
| Open Datasets | Yes | For most experiments we pre-train on the same data as BERT, which consists of 3.3 Billion tokens from Wikipedia and Books Corpus (Zhu et al., 2015). However, for our Large model we pre-trained on the data used for XLNet (Yang et al., 2019), which extends the BERT dataset to 33B tokens by including data from Clue Web (Callan et al., 2009), Common Crawl, and Gigaword (Parker et al., 2011). |
| Dataset Splits | Yes | We evaluate on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and Stanford Question Answering (SQu AD) dataset (Rajpurkar et al., 2016). Unless stated otherwise, results are on the dev set. |
| Hardware Specification | Yes | Train Time + Hardware: 4d on 1 V100 GPU, 4d on 16 TPUv3s |
| Software Dependencies | No | The paper mentions 'Tensor Flow’s FLOP-counting capabilities' but does not specify a version number for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | The full set of hyperparameters are listed in Table 6. (e.g., Number of layers 12, Hidden Size 256, Learning Rate 5e-4, Batch Size 128, Train Steps 1.45M/1M for Small models) and The full set of hyperparameters is listed in Table 7. (e.g., Learning Rate 3e-4, Batch Size 32, Train Epochs 10 for GLUE) |