Understand and Modularize Generator Optimization in ELECTRA-style Pretraining
Authors: Chengyu Dong, Liyuan Liu, Hao Cheng, Jingbo Shang, Jianfeng Gao, Xiaodong Liu
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with the standard BERTbase and BERTlarge (Devlin et al., 2019) pre-training setting on the GLUE (Wang et al., 2018) benchmark, and our simple technique consistently outperforms the original ELECTRA design and alternative pre-training specifications that are more recently proposed. (Section 1, Introduction); 6. Experiments |
| Researcher Affiliation | Collaboration | Chengyu Dong 1 2 Liyuan Liu 3 Hao Cheng 3 Jingbo Shang 1 Jianfeng Gao 3 Xiaodong Liu 3 1University of California, San Diego 2Work was done during an internship at Microsoft. 3Microsoft Research. |
| Pseudocode | No | The paper describes methods and processes but does not include structured pseudocode or algorithm blocks (e.g., Algorithm 1, Pseudocode). |
| Open Source Code | Yes | Our source code is publicly available 2. 2https://github.com/namisan/Decoupled Optim (Section 1, Introduction) |
| Open Datasets | Yes | Specifically, we employ Wikipedia and Book Corpus (Zhu et al., 2015) (16 GB of texts, 256M samples) for pretraining with sequence length as 512. (Section 6.1, Pretraining Setup) |
| Dataset Splits | Yes | We conduct the evaluation on downstream tasks following the setup in previous works (Meng et al., 2021; Bajaj et al., 2022). Specifically, we evaluate on GLUE (Wang et al., 2018) language understanding benchmark with a single-task, single-model finetuning setting following previous works. (Section 6.1, Downstream evaluation setup) |
| Hardware Specification | Yes | We conduct pretraining on NVIDIA Tesla V100 with 32GB memory and fine-tuning on NVIDIA Tesla P100 with 16GB memory. (Appendix C, A Roadmap to Hyperparameter Tuning) |
| Software Dependencies | No | The paper mentions 'FAIRSEQ' but does not specify its version or other software dependencies with version numbers. |
| Experiment Setup | Yes | Table 2. Hyperparameter settings used in pretraining. Max Steps 125K, Optimizer Adam, Peak Learning Rate (Generator) 2 10 4, Peak Learning Rate (Discriminator) 1.5 10 3, Batch Size 2048, Warm-Up Steps 10K, Sequence Length 512. |