reproducibilityindex.ai

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Thorough experiments demonstrate this new pre-training task is more efﬁcient than MLM because the task is deﬁned over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute.
Researcher Affiliation	Collaboration	Kevin Clark Stanford University kevclark@cs.stanford.edu Minh-Thang Luong Google Brain thangluong@google.com Quoc V. Le Google Brain qvl@google.com Christopher D. Manning Stanford University & CIFAR Fellow manning@cs.stanford.edu
Pseudocode	No	The paper describes the model architecture and training process using mathematical equations and textual descriptions, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	Code and pre-trained weights will be released at https://github.com/google-research/electra
Open Datasets	Yes	For most experiments we pre-train on the same data as BERT, which consists of 3.3 Billion tokens from Wikipedia and Books Corpus (Zhu et al., 2015). However, for our Large model we pre-trained on the data used for XLNet (Yang et al., 2019), which extends the BERT dataset to 33B tokens by including data from Clue Web (Callan et al., 2009), Common Crawl, and Gigaword (Parker et al., 2011).
Dataset Splits	Yes	We evaluate on the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019) and Stanford Question Answering (SQu AD) dataset (Rajpurkar et al., 2016). Unless stated otherwise, results are on the dev set.
Hardware Specification	Yes	Train Time + Hardware: 4d on 1 V100 GPU, 4d on 16 TPUv3s
Software Dependencies	No	The paper mentions 'Tensor Flow’s FLOP-counting capabilities' but does not specify a version number for TensorFlow or any other software dependencies.
Experiment Setup	Yes	The full set of hyperparameters are listed in Table 6. (e.g., Number of layers 12, Hidden Size 256, Learning Rate 5e-4, Batch Size 128, Train Steps 1.45M/1M for Small models) and The full set of hyperparameters is listed in Table 7. (e.g., Learning Rate 3e-4, Batch Size 32, Train Epochs 10 for GLUE)