SI-VDNAS: Semi-Implicit Variational Dropout for Hierarchical One-shot Neural Architecture Search
Authors: Yaoming Wang, Wenrui Dai, Chenglin Li, Junni Zou, Hongkai Xiong
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that SI-VDNAS finds a convergent architecture with only 2.7 MB parameters within 0.8 GPU-days and can achieve 2.60% top-1 error rate on CIFAR-10.4 Experiments |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Shanghai Jiao Tong University, China 2Department of Computer Science & Engineering, Shanghai Jiao Tong University, China {wang yaoming, daiwenrui, lcl1985, zoujunni, xionghongkai}@sjtu.edu.cn |
| Pseudocode | Yes | Algorithm 1 Semi-implicit Variational Dropout NAS |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | 4.1 Datasets CIFAR-10/100 [Krizhevsky and Hinton, 2009] is a popular dataset consisting of 60K images, 50K training images and 10K test images. ... Image Net [Deng et al., 2009] is a large-scale benchmark for image classification. |
| Dataset Splits | Yes | For the training, we split the training images into two subsets with the same size. One subset is used for training network parameters, the other is used for architectural parameters.CIFAR-10/100 [Krizhevsky and Hinton, 2009] is a popular dataset consisting of 60K images, 50K training images and 10K test images. |
| Hardware Specification | Yes | The search process requires 8 GPU-hours for optimal structure within 50 epochs and 20 GPU-hours for a convergent result within 150 epochs on a single NVIDIA GTX 1080Ti GPU. The search time can be reduced by about 50% on a single Tesla V100 GPU. |
| Software Dependencies | No | The paper mentions optimizers (e.g., SGD) and techniques (e.g., Cutout, drop-path trick) but does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow, PyTorch, scikit-learn versions) required to replicate the experiments. |
| Experiment Setup | Yes | The initial number of channels is set to 36. The network weights are trained from scratch using all the 50K training images with a batch size of 96. The network is trained for 600 epochs. We use the SGD optimizer with an initial learning rate of 0.025 (annealed down to zero following a cosine schedule without restart), a momentum of 0.9, a weight decay of 3 10 4/5 10 4 and a norm gradient clipping at 5. We apply the drop-path trick with the probability of 0.3. Cutout is also used in our evaluation. |