reproducibilityindex.ai

Efficient Architecture Search by Network Transformation

Authors: Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, Jun Wang

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our method to explore the architecture space of the plain convolutional neural networks (no skip-connections, branching etc.) on image benchmark datasets (CIFAR-10, SVHN) with restricted computational resources (5 GPUs).
Researcher Affiliation	Academia	1Shanghai Jiao Tong University, 2University College London {hcai,tychen,wnzhang,yyu}@apex.sjtu.edu.cn, j.wang@cs.ucl.ac.uk
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Experiment code and discovered top architectures along with weights: https://github.com/han-cai/EAS
Open Datasets	Yes	The CIFAR-10 dataset (Krizhevsky and Hinton 2009) consists of 50,000 training images and 10,000 test images. The Street View House Numbers (SVHN) dataset (Netzer et al. 2011) contains 73,257 images in the original training set, 26,032 images in the test set, and 531,131 additional images in the extra training set.
Dataset Splits	Yes	Following the previous work (Baker et al. 2017; Zoph and Le 2017), we randomly sample 5,000 images from the training set to form a validation set while using the remaining 45,000 images for training during exploring the architecture space. We follow (Baker et al. 2017) and use the original training set during the architecture search phase with 5,000 randomly sampled images as the validation set, while training the ﬁnal discovered architectures using all the training data, including the original training set and extra training set.
Hardware Specification	Yes	Speciﬁcally, it takes less than 2 days on 5 Ge Force GTX 1080 GPUs with totally 450 networks trained to achieve 4.89% test error rate on C10+ starting from a small network.
Software Dependencies	No	The paper mentions software like ADAM optimizer and SGD with Nesterov momentum, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For the meta-controller, we use a one-layer bidirectional LSTM with 50 hidden units as the encoder network (Figure 1) with an embedding size of 16, and train it with the ADAM optimizer (Kingma and Ba 2015). At each step, the meta-controller samples 10 networks by taking network transformation actions. Since the sampled networks are not trained from scratch but we reuse weights of the given network in our scenario, they are then trained for 20 epochs, a relative small number compared to 50 epochs in (Zoph and Le 2017). Besides, we use a smaller initial learning rate for this reason. Other settings for training networks on CIFAR-10 and SVHN, are similar to (Huang et al. 2017; Zoph and Le 2017). Speciﬁcally, we use the SGD with a Nesterov momentum (Sutskever et al. 2013) of 0.9, a weight decay of 0.0001, a batch size of 64. The initial learning rate is 0.02 and is further annealed with a cosine learning rate decay (Gastaldi 2017). For every convolutional layer, the ﬁlter size is chosen from {1, 3, 5} and the number of ﬁlters is chosen from {16, 32, 64, 96, 128, 192, 256, 320, 384, 448, 512}, while the stride is ﬁxed to be 1 (Baker et al. 2017). For every fully-connected layer, the number of units is chosen from {64, 128, 256, 384, 512, 640, 768, 896, 1024}. Additionally, we use Re LU and batch normalization for each convolutional or fully-connected layer. For SVHN, we add a dropout layer after each convolutional layer (except the ﬁrst layer) and use a dropout rate of 0.2 (Huang et al. 2017).