Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Analyzing and Mitigating Interference in Neural Architecture Search

Authors: Jin Xu, Xu Tan, Kaitao Song, Renqian Luo, Yichong Leng, Tao Qin, Tie-Yan Liu, Jian Li

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on a BERT search space verify that mitigating interference via each of our proposed methods improves the rank correlation of super-net and combining both methods can achieve better results. Our discovered architecture outperforms Ro BERTabase by 1.1 and 0.6 points and ELECTRAbase by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. Extensive results on the BERT compression, reading comprehension and Image Net task demonstrate the effectiveness and generality of our proposed methods.
Researcher Affiliation	Collaboration	1Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University 2Microsoft Research Asia 3University of Science and Technology of China. Correspondence to: Xu Tan <EMAIL>, Jian Li <EMAIL>.
Pseudocode	No	The training procedure of MAGIC-A at each step is as follows: Obtain a batch of data and an anchor child model αl, and randomly sample a child model αt, Calculate the loss according to Eq. (5) and update the weights of αt, Replace αl with αt if Val(αt) > Val(αl), where Val( ) is the accuracy obtained from the dev set.
Open Source Code	No	The paper does not contain any statement about releasing source code or a link to a code repository for their methods.
Open Datasets	Yes	Following BERT (Devlin et al., 2019), we train the super-net and discover architectures using Book Corpus plus English Wikipedia (16GB in total). ... We evaluate performance by fine-tuning pre-trained models on GLUE benchmark (Wang et al., 2019) ... We further evaluate the generalizability of our searched architecture by fine-tuning it to reading comprehension tasks SQu AD v1.1 (Rajpurkar et al., 2016) and SQu AD v2.0 (Rajpurkar et al., 2018). ... We use a Mobile Net-v2 (Sandler et al., 2018) based search search space following Proxyless NAS (Cai et al., 2018).
Dataset Splits	Yes	Our discovered architecture outperforms Ro BERTabase by 1.1 and 0.6 points and ELECTRAbase by 1.6 and 1.1 points on the dev and test set of GLUE benchmark. and Replace αl with αt if Val(αt) > Val(αl), where Val( ) is the accuracy obtained from the dev set.
Hardware Specification	Yes	We train an N = 12 layer super-net using a batch of 1024 sentences on 32 NVIDIA P40 GPUs until 62,500 steps. and For super-net training, we use the SGD optimizer with an initial learning rate of 0.4 and a cosine learning rate, and train the super-net on 8 V100 GPUs for 150 epochs with a batch size of 512.
Software Dependencies	No	Our experiments are implemented with fairseq codebase (Ott et al., 2019).
Experiment Setup	Yes	We use Adam (Kingma & Ba, 2015) with a learning rate of 1e-4, β1 = 0.9 and β2 = 0.999. The peak learning rate is 5e-4 with a warmup step of 10,000 followed by linear annealing. The dropout rate is 0.1 and the weight decay is 0.01. We set the max length of sentences as 128 tokens. The super-net is trained with the batch size of 1024 sentences for 250,000 steps.