Operation-Level Early Stopping for Robustifying Differentiable NAS

Authors: Shen Jiang, Zipeng Ji, Guanghui Zhu, Chunfeng Yuan, Yihua Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results can verify our hypothesis and the effectiveness of OLES. OLES achieves a top-1 test error of 2.30% on CIFAR-10 with nearly the same search cost as DARTS of 0.4 GPU-days. We conduct NAS on CIFAR-10 using DARTS search space and then transfer the searched architectures to CIFAR-100.
Researcher Affiliation Academia State Key Laboratory for Novel Software Technology, Nanjing University {jiangshen, jizipeng}@smail.nju.edu.cn, {zgh, cfyuan, yhuang}@nju.edu.cn
Pseudocode Yes The algorithm for DARTS with operation-level early stopping can be found in Algorithm 1.
Open Source Code Yes Open-source code can be found at https://github.com/Pasa Lab/oles.
Open Datasets Yes We conduct NAS on CIFAR-10 using DARTS search space and then transfer the searched architectures to CIFAR-100. To verify the transferability of architectures discovered by OLES, we transfer the best architecture derived from CIFAR-10 to Image Net. NAS-Bench-201 Dong and Yang [2020] provides a unified benchmark for analyzing various up-to-date NAS algorithms. It contains 4 internal nodes with 5 operations (i.e., Zero, Skip Connection, 1 1 Conv, 3 3 Conv, 3 3 Avg Pol). NAS-Bench-201 offers a similar cell-based search space that comprised a total of 15625 unique architectures. The architectures are trained on three datasets (i.e., CIFAR-10, CIFAR-100, and Image Net16-120).
Dataset Splits Yes Meanwhile, the architecture parameters are trained on the validation data, causing the advantages of parametric operations to weaken progressively and ultimately leading to the domination of parametric-free operations. GM(go train, go val) = PM m=1 COS(go train, go val) M , where COS(go train, go val) = go train go val |go train||go val|, go train = Ltrain(w, α, Btrain) Po , go val = Lval(w, α, Bval) where Btrain and Bval denote batches of training data and validation data, respectively.
Hardware Specification No The paper mentions “GPU-days” as a metric for search cost (e.g., “0.4 GPU-days”) but does not specify any particular GPU models, CPU models, or other hardware components used for the experiments.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers. While common deep learning frameworks like PyTorch and CUDA would implicitly be used, their versions are not specified.
Experiment Setup Yes We keep all experimental settings the same as the original in each search space and only use the first-order optimization, as our method only needs to freeze the operation parameters during the supernet training. In addition to the standard hyperparameters in DARTS, our method introduces an additional hyperparameter, i.e., the overfitting threshold ξ, to stop operation training. We determine the threshold ξ by averaging the cosine similarity over 20 iterations for 30 randomly initiated architectures in each search space. And the gradient matching (GM) score is dynamically computed by averaging over every 20 iterations throughout the entire training process to reduce variance. The threshold ξ is set to 0.3.