SI-VDNAS: Semi-Implicit Variational Dropout for Hierarchical One-shot Neural Architecture Search

Authors: Yaoming Wang, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that SI-VDNAS finds a convergent architecture with only 2.7 MB parameters within 0.8 GPU-days and can achieve 2.60% top-1 error rate on CIFAR-10.4 Experiments
Researcher Affiliation Academia 1Department of Electronic Engineering, Shanghai Jiao Tong University, China 2Department of Computer Science & Engineering, Shanghai Jiao Tong University, China {wang yaoming, daiwenrui, lcl1985, zoujunni, xionghongkai}@sjtu.edu.cn
Pseudocode Yes Algorithm 1 Semi-implicit Variational Dropout NAS
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes 4.1 Datasets CIFAR-10/100 [Krizhevsky and Hinton, 2009] is a popular dataset consisting of 60K images, 50K training images and 10K test images. ... Image Net [Deng et al., 2009] is a large-scale benchmark for image classification.
Dataset Splits Yes For the training, we split the training images into two subsets with the same size. One subset is used for training network parameters, the other is used for architectural parameters.CIFAR-10/100 [Krizhevsky and Hinton, 2009] is a popular dataset consisting of 60K images, 50K training images and 10K test images.
Hardware Specification Yes The search process requires 8 GPU-hours for optimal structure within 50 epochs and 20 GPU-hours for a convergent result within 150 epochs on a single NVIDIA GTX 1080Ti GPU. The search time can be reduced by about 50% on a single Tesla V100 GPU.
Software Dependencies No The paper mentions optimizers (e.g., SGD) and techniques (e.g., Cutout, drop-path trick) but does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow, PyTorch, scikit-learn versions) required to replicate the experiments.
Experiment Setup Yes The initial number of channels is set to 36. The network weights are trained from scratch using all the 50K training images with a batch size of 96. The network is trained for 600 epochs. We use the SGD optimizer with an initial learning rate of 0.025 (annealed down to zero following a cosine schedule without restart), a momentum of 0.9, a weight decay of 3 10 4/5 10 4 and a norm gradient clipping at 5. We apply the drop-path trick with the probability of 0.3. Cutout is also used in our evaluation.