Efficient Neural Architecture Search via Parameters Sharing

Authors: Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, Jeff Dean

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design... On Penn Treebank, ENAS discovers a novel architecture that achieves a test perplexity of 56.3... On CIFAR-10, ENAS finds a novel architecture that achieves 2.89% test error... Importantly, in all of our experiments, for which we use a single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours.
Researcher Affiliation Collaboration 1Google Brain 2Language Technology Institute, Carnegie Mellon University 3Department of Computer Science, Stanford University.
Pseudocode No The paper describes the ENAS mechanism through examples and prose, but does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any statements about releasing code for the described methodology, nor does it provide any repository links.
Open Datasets Yes Penn Treebank (Marcus et al., 1994) is a well-studied benchmark for language model. We use the standard pre-processed version of the dataset, which is also used by previous works, e.g. Zaremba et al. (2014). The CIFAR-10 dataset (Krizhevsky, 2009) consists of 50,000 training images and 10,000 test images.
Dataset Splits Yes The reward R(m, ω) is computed on the validation set, rather than on the training set, to encourage ENAS to select models that generalize well rather than models that overfit the training set well.
Hardware Specification Yes Importantly, in all of our experiments, for which we use a single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours. Running on a single Nvidia GTX 1080Ti GPU, ENAS finds a recurrent cell in about 10 hours.
Software Dependencies No Our controller is trained using Adam... The shared parameters of the child models ω are trained using SGD... We employ the Adam optimizer (Kingma & Ba, 2015)... The shared parameters ω are trained with Nesterov momentum (Nesterov, 1983)... We also apply an ℓ2 weight decay... variational dropout (Gal & Ghahramani, 2016); and tying word embeddings and softmax weights (Inan et al., 2017).
Experiment Setup Yes Our controller is trained using Adam, with a learning rate of 0.00035. To prevent premature convergence, we also use a tanh constant of 2.5 and a temperature of 5.0 for the sampling logits... add the controller s sample entropy to the reward, weighted by 0.0001... The shared parameters of the child models ω are trained using SGD with a learning rate of 20.0, decayed by a factor of 0.96 after every epoch starting at epoch 15, for a total of 150 epochs. We clip the norm of the gradient ω at 0.25... The shared parameters ω are trained with Nesterov momentum... learning rate follows the cosine schedule with lmax = 0.05, lmin = 0.001, T0 = 10, and Tmul = 2... Each architecture search is run for 310 epochs. We initialize ω with He initialization... We also apply an ℓ2 weight decay of 10-4. The policy parameters θ are initialized uniformly in [ 0.1, 0.1], and trained with Adam at a learning rate of 0.00035.