Analyzing and Mitigating Interference in Neural Architecture Search

Authors: Jin Xu, Xu Tan, Kaitao Song, Renqian Luo, Yichong Leng, Tao Qin, Tie-Yan Liu, Jian Li

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a BERT search space verify that mitigating interference via each of our proposed methods improves the rank correlation of super-net and combining both methods can achieve better results. Our discovered architecture outperforms Ro BERTabase by 1.1 and 0.6 points and ELECTRAbase by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. Extensive results on the BERT compression, reading comprehension and Image Net task demonstrate the effectiveness and generality of our proposed methods.
Researcher Affiliation Collaboration 1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University 2Microsoft Research Asia 3University of Science and Technology of China. Correspondence to: Xu Tan <xuta@microsoft.com>, Jian Li <lijian83@mail.tsinghua.edu.cn>.
Pseudocode No The training procedure of MAGIC-A at each step is as follows: Obtain a batch of data and an anchor child model αl, and randomly sample a child model αt, Calculate the loss according to Eq. (5) and update the weights of αt, Replace αl with αt if Val(αt) > Val(αl), where Val( ) is the accuracy obtained from the dev set.
Open Source Code No The paper does not contain any statement about releasing source code or a link to a code repository for their methods.
Open Datasets Yes Following BERT (Devlin et al., 2019), we train the super-net and discover architectures using Book Corpus plus English Wikipedia (16GB in total). ... We evaluate performance by fine-tuning pre-trained models on GLUE benchmark (Wang et al., 2019) ... We further evaluate the generalizability of our searched architecture by fine-tuning it to reading comprehension tasks SQu AD v1.1 (Rajpurkar et al., 2016) and SQu AD v2.0 (Rajpurkar et al., 2018). ... We use a Mobile Net-v2 (Sandler et al., 2018) based search search space following Proxyless NAS (Cai et al., 2018).
Dataset Splits Yes Our discovered architecture outperforms Ro BERTabase by 1.1 and 0.6 points and ELECTRAbase by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. and Replace αl with αt if Val(αt) > Val(αl), where Val( ) is the accuracy obtained from the dev set.
Hardware Specification Yes We train an N = 12 layer super-net using a batch of 1024 sentences on 32 NVIDIA P40 GPUs until 62,500 steps. and For super-net training, we use the SGD optimizer with an initial learning rate of 0.4 and a cosine learning rate, and train the super-net on 8 V100 GPUs for 150 epochs with a batch size of 512.
Software Dependencies No Our experiments are implemented with fairseq codebase (Ott et al., 2019).
Experiment Setup Yes We use Adam (Kingma & Ba, 2015) with a learning rate of 1e-4, β1 = 0.9 and β2 = 0.999. The peak learning rate is 5e-4 with a warmup step of 10,000 followed by linear annealing. The dropout rate is 0.1 and the weight decay is 0.01. We set the max length of sentences as 128 tokens. The super-net is trained with the batch size of 1024 sentences for 250,000 steps.