reproducibilityindex.ai

Efficient Neural Architecture Search via Parameters Sharing

Authors: Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, Jeff Dean

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose Efﬁcient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design... On Penn Treebank, ENAS discovers a novel architecture that achieves a test perplexity of 56.3... On CIFAR-10, ENAS ﬁnds a novel architecture that achieves 2.89% test error... Importantly, in all of our experiments, for which we use a single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours.
Researcher Affiliation	Collaboration	1Google Brain 2Language Technology Institute, Carnegie Mellon University 3Department of Computer Science, Stanford University.
Pseudocode	No	The paper describes the ENAS mechanism through examples and prose, but does not include any formal pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any statements about releasing code for the described methodology, nor does it provide any repository links.
Open Datasets	Yes	Penn Treebank (Marcus et al., 1994) is a well-studied benchmark for language model. We use the standard pre-processed version of the dataset, which is also used by previous works, e.g. Zaremba et al. (2014). The CIFAR-10 dataset (Krizhevsky, 2009) consists of 50,000 training images and 10,000 test images.
Dataset Splits	Yes	The reward R(m, ω) is computed on the validation set, rather than on the training set, to encourage ENAS to select models that generalize well rather than models that overﬁt the training set well.
Hardware Specification	Yes	Importantly, in all of our experiments, for which we use a single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours. Running on a single Nvidia GTX 1080Ti GPU, ENAS ﬁnds a recurrent cell in about 10 hours.
Software Dependencies	No	Our controller is trained using Adam... The shared parameters of the child models ω are trained using SGD... We employ the Adam optimizer (Kingma & Ba, 2015)... The shared parameters ω are trained with Nesterov momentum (Nesterov, 1983)... We also apply an ℓ2 weight decay... variational dropout (Gal & Ghahramani, 2016); and tying word embeddings and softmax weights (Inan et al., 2017).
Experiment Setup	Yes	Our controller is trained using Adam, with a learning rate of 0.00035. To prevent premature convergence, we also use a tanh constant of 2.5 and a temperature of 5.0 for the sampling logits... add the controller s sample entropy to the reward, weighted by 0.0001... The shared parameters of the child models ω are trained using SGD with a learning rate of 20.0, decayed by a factor of 0.96 after every epoch starting at epoch 15, for a total of 150 epochs. We clip the norm of the gradient ω at 0.25... The shared parameters ω are trained with Nesterov momentum... learning rate follows the cosine schedule with lmax = 0.05, lmin = 0.001, T0 = 10, and Tmul = 2... Each architecture search is run for 310 epochs. We initialize ω with He initialization... We also apply an ℓ2 weight decay of 10-4. The policy parameters θ are initialized uniformly in [ 0.1, 0.1], and trained with Adam at a learning rate of 0.00035.