ZiCo: Zero-shot NAS via inverse Coefficient of Variation on Gradients
Authors: Guihong Li, Yuedong Yang, Kartikeya Bhardwaj, Radu Marculescu
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Zi Co works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via Zi Co are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, Zi Co-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs, respectively, on Image Net within 0.4 GPU days. |
| Researcher Affiliation | Collaboration | 1The University of Texas at Austin, 2Qualcomm {lgh,albertyoung,radum}@utexas.edu, kbhardwa@qti.qualcomm.com |
| Pseudocode | Yes | Algorithm 1 Zi Co-based zero-shot NAS framework |
| Open Source Code | Yes | Our code is available at https://github.com/SLDGroup/Zi Co. |
| Open Datasets | Yes | For the experiments (i), to validate Theorem 3.1, we optimize a linear model as in Eq. 2 on the MNIST dataset, the mean gradient values and the standard deviation vs. the total training loss. Moreover, we also optimize the model defined by Eq. 7 on MNIST and report the training loss vs. the standard deviation in order to validate Theorem 3.2 and Theorem 3.5. For experiments (ii), we compare our proposed Zi Co against existing proxies on three mainstream NAS benchmarks: NATSBench is a popular cell-based search space with two different search spaces: (1) NATSBench-TSS consisting of 15625 total architectures with different cell structures trained on CIFAR10, CIFAR100, and Image Net16-120 (Img16-120) datasets, which is just renamed from NASBench-201 Dong & Yang (2020); (2) NATSBench-SSS contains includes 32768 architectures (which differ only in the width values of each layer) and is also trained on the same three above datasets Dong et al. (2021). NASBench101 provides users with 423k neural architectures with their test accuracy on CIFAR10 dataset; the architectures are built by stacking the same cell multiple times Ying et al. (2019). Trans NASBench101-Mirco contains 4096 networks with different cell structures on various downstream applications (see Appendix E.2) Duan et al. (2021). |
| Dataset Splits | Yes | NATSBench-TSS consisting of 15625 total architectures with different cell structures trained on CIFAR10, CIFAR100, and Image Net16-120 (Img16-120) datasets, which is just renamed from NASBench-201 Dong & Yang (2020); (2) NATSBench-SSS contains includes 32768 architectures (which differ only in the width values of each layer) and is also trained on the same three above datasets Dong et al. (2021). NASBench101 provides users with 423k neural architectures with their test accuracy on CIFAR10 dataset; the architectures are built by stacking the same cell multiple times Ying et al. (2019). |
| Hardware Specification | Yes | We conduct the search for 100k steps; this takes 10 hours on a single NVIDIA 3090 GPU (i.e., 0.4 GPU days). Then, we train the obtained network with the exact same training setup as Lin et al. (2021). Specifically, we train the neural network for 480 epochs with the batch size 512 and input resolution 224. We also use the distillation-based training loss functions by taking Efficient-B3 as the teacher. Finally, we set the initial learning rate as 0.1 with a cosine annealing scheduling scheme. |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for ancillary software dependencies such as libraries or frameworks (e.g., PyTorch, TensorFlow). It mentions optimizers like SGD and activation functions like ReLU but not with versioned software packages. |
| Experiment Setup | Yes | We use the SGD optimizer with momentum 0.9 and weight decay 4e-5. We set the initial learning rate as 0.1 and used the cosine annealing scheme to adjust the learning rate during training. We train the obtained network 480 epochs with the batch size 512 and input resolution 224. |