Efficient Architecture Search by Network Transformation
Authors: Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, Jun Wang
AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply our method to explore the architecture space of the plain convolutional neural networks (no skip-connections, branching etc.) on image benchmark datasets (CIFAR-10, SVHN) with restricted computational resources (5 GPUs). |
| Researcher Affiliation | Academia | 1Shanghai Jiao Tong University, 2University College London {hcai,tychen,wnzhang,yyu}@apex.sjtu.edu.cn, j.wang@cs.ucl.ac.uk |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Experiment code and discovered top architectures along with weights: https://github.com/han-cai/EAS |
| Open Datasets | Yes | The CIFAR-10 dataset (Krizhevsky and Hinton 2009) consists of 50,000 training images and 10,000 test images. The Street View House Numbers (SVHN) dataset (Netzer et al. 2011) contains 73,257 images in the original training set, 26,032 images in the test set, and 531,131 additional images in the extra training set. |
| Dataset Splits | Yes | Following the previous work (Baker et al. 2017; Zoph and Le 2017), we randomly sample 5,000 images from the training set to form a validation set while using the remaining 45,000 images for training during exploring the architecture space. We follow (Baker et al. 2017) and use the original training set during the architecture search phase with 5,000 randomly sampled images as the validation set, while training the final discovered architectures using all the training data, including the original training set and extra training set. |
| Hardware Specification | Yes | Specifically, it takes less than 2 days on 5 Ge Force GTX 1080 GPUs with totally 450 networks trained to achieve 4.89% test error rate on C10+ starting from a small network. |
| Software Dependencies | No | The paper mentions software like ADAM optimizer and SGD with Nesterov momentum, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For the meta-controller, we use a one-layer bidirectional LSTM with 50 hidden units as the encoder network (Figure 1) with an embedding size of 16, and train it with the ADAM optimizer (Kingma and Ba 2015). At each step, the meta-controller samples 10 networks by taking network transformation actions. Since the sampled networks are not trained from scratch but we reuse weights of the given network in our scenario, they are then trained for 20 epochs, a relative small number compared to 50 epochs in (Zoph and Le 2017). Besides, we use a smaller initial learning rate for this reason. Other settings for training networks on CIFAR-10 and SVHN, are similar to (Huang et al. 2017; Zoph and Le 2017). Specifically, we use the SGD with a Nesterov momentum (Sutskever et al. 2013) of 0.9, a weight decay of 0.0001, a batch size of 64. The initial learning rate is 0.02 and is further annealed with a cosine learning rate decay (Gastaldi 2017). For every convolutional layer, the filter size is chosen from {1, 3, 5} and the number of filters is chosen from {16, 32, 64, 96, 128, 192, 256, 320, 384, 448, 512}, while the stride is fixed to be 1 (Baker et al. 2017). For every fully-connected layer, the number of units is chosen from {64, 128, 256, 384, 512, 640, 768, 896, 1024}. Additionally, we use Re LU and batch normalization for each convolutional or fully-connected layer. For SVHN, we add a dropout layer after each convolutional layer (except the first layer) and use a dropout rate of 0.2 (Huang et al. 2017). |