XNAS: Neural Architecture Search with Expert Advice

Authors: Niv Nayman, Asaf Noy, Tal Ridnik, Itamar Friedman, Rong Jin, Lihi Zelnik

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our algorithm achieves a strong performance over several image classification datasets. Specifically, it obtains an error rate of 1.6% for CIFAR-10, 23.9% for Image Net under mobile settings, and achieves state-of-the-art results on three additional datasets.
Researcher Affiliation Industry Niv Nayman , Asaf Noy , Tal Ridnik , Itamar Friedman, Rong Jin, Lihi Zelnik-Manor Machine Intelligence Technology, Alibaba Group {niv.nayman,asaf.noy,tal.ridnik,itamar.friedman,jinrong.jr,lihi.zelnik} @alibaba-inc.com
Pseudocode Yes Algorithm 1 XNAS for a single forecaster
Open Source Code Yes XNAS evaluation results can be reproduced using the code: https://github.com/NivNayman/XNAS
Open Datasets Yes We used the CIFAR-10 dataset for the main search and evaluation phase. In addition, using the cell found on CIFAR-10 we did transferability experiments on the well-known benchmarks Image Net, CIFAR-100, SVHN, Fashion-MNIST, Freiburg and CINIC10.
Dataset Splits Yes The train set is divided into two parts of equal sizes: one is used for training the operations weights ω and the other for training the architecture weights v, both with respect to the cross entropy loss. With a batch size of 96, one epoch takes 8.5 minutes in average on a single GPU2 , summing up to 7 hours in total for a single search. For example, for CIFAR10 with 50%:50% train-validation split, 50 search epochs...
Hardware Specification Yes Experiments were performed using a NVIDIA GTX 1080Ti GPU.
Software Dependencies No The paper mentions optimizers like SGD with nesterov-momentum and Adam, but does not provide specific version numbers for software libraries or environments (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes The search phase lasts up to 50 epochs. We use the first-order approximation [25], relating to v and ω as independent parameters which can be optimized separately. The train set is divided into two parts of equal sizes: one is used for training the operations weights ω and the other for training the architecture weights v, both with respect to the cross entropy loss. With a batch size of 96, one epoch takes 8.5 minutes in average on a single GPU2 , summing up to 7 hours in total for a single search. We trained the network for 1500 epochs using a batch size of 96 and SGD optimizer with nesterov-momentum. Our learning rate regime was composed of 5 cycles of power cosine annealing learning rate [17], with amplitude decay factor of 0.5 per cycle. For regularization we used cutout [9], scheduled drop-path [22], auxiliary towers [39], label smoothing [40] Auto Augment [7] and weight decay.