Understand and Modularize Generator Optimization in ELECTRA-style Pretraining

Authors: Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao, Xiaodong Liu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments with the standard BERTbase and BERTlarge (Devlin et al., 2019) pre-training setting on the GLUE (Wang et al., 2018) benchmark, and our simple technique consistently outperforms the original ELECTRA design and alternative pre-training specifications that are more recently proposed. (Section 1, Introduction); 6. Experiments
Researcher Affiliation Collaboration Chengyu Dong 1 2 Liyuan Liu 3 Hao Cheng 3 Jingbo Shang 1 Jianfeng Gao 3 Xiaodong Liu 3 1University of California, San Diego 2Work was done during an internship at Microsoft. 3Microsoft Research.
Pseudocode No The paper describes methods and processes but does not include structured pseudocode or algorithm blocks (e.g., Algorithm 1, Pseudocode).
Open Source Code Yes Our source code is publicly available 2. 2https://github.com/namisan/Decoupled Optim (Section 1, Introduction)
Open Datasets Yes Specifically, we employ Wikipedia and Book Corpus (Zhu et al., 2015) (16 GB of texts, 256M samples) for pretraining with sequence length as 512. (Section 6.1, Pretraining Setup)
Dataset Splits Yes We conduct the evaluation on downstream tasks following the setup in previous works (Meng et al., 2021; Bajaj et al., 2022). Specifically, we evaluate on GLUE (Wang et al., 2018) language understanding benchmark with a single-task, single-model finetuning setting following previous works. (Section 6.1, Downstream evaluation setup)
Hardware Specification Yes We conduct pretraining on NVIDIA Tesla V100 with 32GB memory and fine-tuning on NVIDIA Tesla P100 with 16GB memory. (Appendix C, A Roadmap to Hyperparameter Tuning)
Software Dependencies No The paper mentions 'FAIRSEQ' but does not specify its version or other software dependencies with version numbers.
Experiment Setup Yes Table 2. Hyperparameter settings used in pretraining. Max Steps 125K, Optimizer Adam, Peak Learning Rate (Generator) 2 10 4, Peak Learning Rate (Discriminator) 1.5 10 3, Batch Size 2048, Warm-Up Steps 10K, Sequence Length 512.