DrNAS: Dirichlet Neural Architecture Search
Authors: Xiangning Chen, Ruochen Wang, Minhao Cheng, Xiaocheng Tang, Cho-Jui Hsieh
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our method. Specifically, we obtain a test error of 2.46% for CIFAR-10, 23.7% for Image Net under the mobile setting. On NASBench-201, we also achieve state-of-the-art results on all three datasets and provide insights for the effective design of neural architecture search algorithms. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, UCLA, 2Di Di AI Labs {xiangning, chohsieh}@cs.ucla.edu {ruocwang, mhcheng}@ucla.edu xiaochengtang@didiglobal.com |
| Pseudocode | No | The paper describes its methods using mathematical formulations and textual descriptions but does not contain structured pseudocode or algorithm blocks (e.g., clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Our code is available at https://github.com/xiangning-chen/Dr NAS. |
| Open Datasets | Yes | We conduct extensive experiments on different datasets and search spaces to demonstrate Dr NAS s effectiveness. Based on the DARTS search space (Liu et al., 2019), we achieve an average error rate of 2.46% on CIFAR-10... On NAS-Bench-201 (Dong & Yang, 2020), we also set new state-of-the-art results on all three datasets... NAS-Bench-201 provides support for 3 dataset (CIFAR-10, CIFAR-100, Image Net-16-120 (Chrabaszcz et al., 2017)). |
| Dataset Splits | Yes | We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer. |
| Hardware Specification | Yes | In the first stage, we set the partial channel parameter K as 6 to fit the super-network into a single GTX 1080Ti GPU with 11GB memory, i.e., only 1/6 features are sampled on each edge. |
| Software Dependencies | No | The paper mentions optimizers (momentum SGD, Adam), learning rate schedulers (cosine annealing, warm-up), and regularization techniques (cutout, drop-path, label smoothing) but does not provide specific software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | Search Settings We equally divide the 50K training images into two parts, one is used for optimizing the network weights by momentum SGD and the other for learning the Dirichlet architecture distribution by an Adam optimizer. Since Dirichlet concentration β must be positive, we apply the shifted exponential linear mapping β = ELU(η) + 1 and optimize over η instead. We use l2 norm to constrain the distance between η and the anchor ˆη = 0. The η is initialized by standard Gaussian with scale 0.001, and λ in (2) is set to 0.001... Retrain Settings The evaluation phase uses the entire 50K training set to train the network from scratch for 600 epochs. The network weight is optimized by an SGD optimizer with a cosine annealing learning rate initialized as 0.025, a momentum of 0.9, and a weight decay of 3 10 4. To allow a fair comparison with previous work, we also employ cutout regularization with length 16, drop-path (Zoph et al., 2018) with probability 0.3 and an auxiliary tower of weight 0.4. |