Operation-Level Early Stopping for Robustifying Differentiable NAS
Authors: Shen Jiang, Zipeng Ji, Guanghui Zhu, Chunfeng Yuan, Yihua Huang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results can verify our hypothesis and the effectiveness of OLES. OLES achieves a top-1 test error of 2.30% on CIFAR-10 with nearly the same search cost as DARTS of 0.4 GPU-days. We conduct NAS on CIFAR-10 using DARTS search space and then transfer the searched architectures to CIFAR-100. |
| Researcher Affiliation | Academia | State Key Laboratory for Novel Software Technology, Nanjing University {jiangshen, jizipeng}@smail.nju.edu.cn, {zgh, cfyuan, yhuang}@nju.edu.cn |
| Pseudocode | Yes | The algorithm for DARTS with operation-level early stopping can be found in Algorithm 1. |
| Open Source Code | Yes | Open-source code can be found at https://github.com/Pasa Lab/oles. |
| Open Datasets | Yes | We conduct NAS on CIFAR-10 using DARTS search space and then transfer the searched architectures to CIFAR-100. To verify the transferability of architectures discovered by OLES, we transfer the best architecture derived from CIFAR-10 to Image Net. NAS-Bench-201 Dong and Yang [2020] provides a unified benchmark for analyzing various up-to-date NAS algorithms. It contains 4 internal nodes with 5 operations (i.e., Zero, Skip Connection, 1 1 Conv, 3 3 Conv, 3 3 Avg Pol). NAS-Bench-201 offers a similar cell-based search space that comprised a total of 15625 unique architectures. The architectures are trained on three datasets (i.e., CIFAR-10, CIFAR-100, and Image Net16-120). |
| Dataset Splits | Yes | Meanwhile, the architecture parameters are trained on the validation data, causing the advantages of parametric operations to weaken progressively and ultimately leading to the domination of parametric-free operations. GM(go train, go val) = PM m=1 COS(go train, go val) M , where COS(go train, go val) = go train go val |go train||go val|, go train = Ltrain(w, α, Btrain) Po , go val = Lval(w, α, Bval) where Btrain and Bval denote batches of training data and validation data, respectively. |
| Hardware Specification | No | The paper mentions “GPU-days” as a metric for search cost (e.g., “0.4 GPU-days”) but does not specify any particular GPU models, CPU models, or other hardware components used for the experiments. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers. While common deep learning frameworks like PyTorch and CUDA would implicitly be used, their versions are not specified. |
| Experiment Setup | Yes | We keep all experimental settings the same as the original in each search space and only use the first-order optimization, as our method only needs to freeze the operation parameters during the supernet training. In addition to the standard hyperparameters in DARTS, our method introduces an additional hyperparameter, i.e., the overfitting threshold ξ, to stop operation training. We determine the threshold ξ by averaging the cosine similarity over 20 iterations for 30 randomly initiated architectures in each search space. And the gradient matching (GM) score is dynamically computed by averaging over every 20 iterations throughout the entire training process to reduce variance. The threshold ξ is set to 0.3. |