Robustifying and Boosting Training-Free Neural Architecture Search
Authors: Zhenfeng He, Yao Shu, Zhongxiang Dai, Bryan Kian Hsiang Low
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results. Our code has been made publicly available at https://github.com/hzf1174/Ro Bo T. |
| Researcher Affiliation | Collaboration | Zhenfeng He1, Yao Shu2 , Zhongxiang Dai3, Bryan Kian Hsiang Low1 1Department of Computer Science, National University of Singapore 2Guangdong Lab of AI and Digital Economy (SZ) 3Laboratory for Information and Decision Systems, Massachusetts Institute of Technology he.zhenfeng@u.nus.edu, shuyao@gml.ac.cn, daizx@mit.edu, lowkh@comp.nus.edu.sg |
| Pseudocode | Yes | Algorithm 1: Optimization of Weight Vector through Bayesian Optimization ... Algorithm 2: Robustifying and Boosting Training-Free Neural Architecture Search (Ro Bo T) |
| Open Source Code | Yes | Our code has been made publicly available at https://github.com/hzf1174/Ro Bo T. |
| Open Datasets | Yes | Our extensive experiments on various NAS benchmark tasks yield substantial empirical evidence to support our theoretical results. Our code has been made publicly available at https://github.com/hzf1174/Ro Bo T. |
| Dataset Splits | Yes | For NAS-Bench-201, we use the CIFAR-10 validation performance after 12 training epochs (i.e., hp=12 ) from the tabular data in NAS-Bench-201 as the objective evaluation metric f for all three datasets and compute the search cost displayed in Table 2 in the same manner (which is the training cost of 20 architectures). ... As for Trans NAS-Bench-101, we note that for tasks Segment., Normal, and Autoenco. on both micro and macro datasets, the training-free metric synflow is inapplicable due to a tanh activation at the architecture s end, so we only use the remaining five training-free metrics. Moreover, given the considerable gap between validation and test performances in Trans NAS-Bench-101, we only report our proposed architecture s validation performance. |
| Hardware Specification | Yes | The result of Ro Bo T is reported with the mean standard deviation of 10 runs and search costs are evaluated on an Nvidia 1080Ti. ... The search costs are evaluated on an Nvidia 1080Ti. |
| Software Dependencies | No | The paper mentions using Adam and SGD optimizers, but does not provide specific version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Given a set of observations [f(A(w1)), . . . , f(A(wt))], we assume that they are randomly drawn from a prior probability distribution, in this case, a GP. The GP is defined by a mean function covariance (or kernel) function. We set the mean function to be a constant, such as 0, and choose the Mat ern kernel for the kernel function... The next weight vector is chosen as wt+1 = arg maxw µ(w) + κσ(w), where κ is the explorationexploitation trade-off constant that regulates the balance between exploring the weight vector space and exploiting the current regression results. ... These architectures have 36 initial channels and an auxiliary tower with a weight of 0.4 for CIFAR-10 and 0.6 for CIFAR-100, located at the 13th layer. We test these architectures on CIFAR-10/100 by employing stochastic gradient descent (SGD) over 600 epochs. The learning rate started at 0.025 and gradually reduced to 0 for CIFAR-10, and from 0.035 to 0.001 for CIFAR-100, using a cosine schedule. The momentum was set at 0.9 and the weight decay was 3 10 4 with a batch size of 96. Additionally, we use Cutout (Devries & Taylor, 2017) and Scheduled Drop Path, which linearly increased from 0 to 0.2 for CIFAR-10 (and from 0 to 0.3 for CIFAR-100) as regularization techniques for CIFAR-10/100. For the Image Net evaluation, we train a 14-layer architecture from scratch over 250 epochs, with a batch size of 1024. The learning rate was initially increased to 0.7 over the first 5 epochs and then gradually decreased to zero following a cosine schedule. The SGD optimizer was used with a momentum of 0.9 and a weight decay of 3 10 5. |