AutoBERT-Zero: Evolving BERT Backbone from Scratch

Authors: Jiahui Gao, Hang Xu, Han Shi, Xiaozhe Ren, Philip L. H. Yu, Xiaodan Liang, Xin Jiang, Zhenguo Li10663-10671

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on the widely used Natural Language Understanding(NLU) and Question Answering(QA) benchmarks.
Researcher Affiliation Collaboration 1 The University of Hong Kong, 2 Huawei Noah s Ark Lab 3 Hong Kong University of Science and Technology 4 The Education University of Hong Kong, 5 Sun Yat-sen University, China
Pseudocode Yes Algorithm 1: OP-NAS Algorithm.
Open Source Code No The paper does not provide a specific link or explicit statement about the availability of the source code for the described methodology.
Open Datasets Yes For pretraining, we use the Books Corpus (Zhu et al. 2015) and English Wikipedia (Devlin et al. 2019). For finetuning and evaluation, we use the General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016).
Dataset Splits Yes For finetuning and evaluation, we use the General Language Understanding Evaluation (GLUE) (Wang et al. 2018) and the Stanford Question Answering Dataset (SQu AD) (Rajpurkar et al. 2016). Unless stated otherwise, downstream tasks are reported using the same metrics in BERT (Devlin et al. 2019). For other settings, we follow the settings of BERT paper. In the NAS phase, we train each candidate architecture for 40,000 steps, which is then evaluated on the proxy task (GLUE).
Hardware Specification Yes The searching phase costs around 24K GPU hours (760+ candidates) on Nvidia V100.
Software Dependencies No The paper mentions common frameworks like PyTorch and CUDA but does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes We use Masked Language Model (MLM) and Next Sentence Prediction (NSP) as pretraining tasks. The whole process can be divided into two phases, namely the NAS phase and the fully-train phase. For NAS phase, we train the base model, whose configuration is the same as BERT-base (L = 12, H = 768, A = 12). Initial M is set as 100, and K is set as 5. Each parent will mutate 5 child architectures. In the NAS phase, we train each candidate architecture for 40,000 steps...