DrNAS: Dirichlet Neural Architecture Search

Authors: Xiangning Chen, Ruochen Wang, Minhao Cheng, Xiaocheng Tang, Cho-Jui Hsieh

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for Image Net under the mobile setting. On NASBench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms.
Researcher Affiliation Collaboration 1Department of Computer Science, UCLA, 2Di Di AI Labs {xiangning, chohsieh}@cs.ucla.edu {ruocwang, mhcheng}@ucla.edu xiaochengtang@didiglobal.com
Pseudocode No The paper describes its methods using mathematical formulations and textual descriptions but does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes Our code is available at https://github.com/xiangning-chen/Dr NAS.
Open Datasets Yes We conduct extensive experiments on different datasets and search spaces to demonstrate Dr NAS s effectiveness. Based on the DARTS search space (Liu et al., 2019), we achieve an average error rate of 2.46% on CIFAR-10... On NAS-Bench-201 (Dong & Yang, 2020), we also set new state-of-the-art results on all three datasets... NAS-Bench-201 provides support for 3 dataset (CIFAR-10, CIFAR-100, Image Net-16-120 (Chrabaszcz et al., 2017)).
Dataset Splits Yes We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer.
Hardware Specification Yes In the first stage, we set the partial channel parameter K as 6 to fit the super-network into a single GTX 1080Ti GPU with 11GB memory, i.e., only 1/6 features are sampled on each edge.
Software Dependencies No The paper mentions optimizers (momentum SGD, Adam), learning rate schedulers (cosine annealing, warm-up), and regularization techniques (cutout, drop-path, label smoothing) but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes Search Settings We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer. Since Dirichlet concentration β must be positive, we apply the shifted exponential linear mapping β = ELU(η) + 1 and optimize over η instead. We use l2 norm to constrain the distance between η and the anchor ˆη = 0. The η is initialized by standard Gaussian with scale 0.001, and λ in (2) is set to 0.001... Retrain Settings The evaluation phase uses the entire 50K training set to train the network from scratch for 600 epochs. The network weight is optimized by an SGD optimizer with a cosine annealing learning rate initialized as 0.025, a momentum of 0.9, and a weight decay of 3 10 4. To allow a fair comparison with previous work, we also employ cutout regularization with length 16, drop-path (Zoph et al., 2018) with probability 0.3 and an auxiliary tower of weight 0.4.