reproducibilityindex.ai

Learning to Mutate with Hypergradient Guided Population

Authors: Zhiqiang Tao, Yaliang Li, Bolin Ding, Ce Zhang, Jingren Zhou, Yun Fu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evidence on synthetic functions is provided to show that HPM outperforms hypergradient signiﬁcantly. Experiments on two benchmark datasets are also conducted to validate the effectiveness of the proposed HPM algorithm for training deep neural networks compared with several strong baselines.
Researcher Affiliation	Collaboration	Zhiqiang Tao1,4 , Yaliang Li2, Bolin Ding2, Ce Zhang3, Jingren Zhou2, Yun Fu4 1Department of Computer Science and Engineering, Santa Clara University 2Alibaba Group 3Department of Computer Science, ETH Zürich 4Department of Electrical & Computer Engineering, Northeastern University
Pseudocode	Yes	Algorithm 1 summarizes the entire HPM scheduling algorithm.
Open Source Code	No	All the codes on benchmark datasets were implemented with Pytorch library.
Open Datasets	Yes	We tune 15 hyperparameters, including 8 dropout rates and 7 data augmentation hyperparameters for Alex Net [20] in the CIFAR10 image dataset [19], and 7 RNN regularization hyperparameters [13, 33, 29] for LSTM [15] model in the Penn Treebank (PTB) [28] corpus dataset.
Dataset Splits	Yes	Train step updates θk t 1 to θk t and evaluates the validation loss Lval(θk t , hk t ) for each k. One training step could be one epoch or a ﬁxed number of iterations. An agent model is ready to be exploited and explored after one step.
Hardware Specification	No	For all the hyperparameter schedule methods, we ran the experiments in the same computing environment.
Software Dependencies	No	All the codes on benchmark datasets were implemented with Pytorch library.
Experiment Setup	Yes	We set the population size as 20 and the truncation selection ratio as 20% for PBT, HPM w/o T, and HPM. We employed the recommended optimizers and learning rates for all the baseline networks and STN models following [25]. Our teacher network was implemented with 64 key slots and was trained with Adam optimizer with a learning rate of 0.001. For the ﬁxed hyperparameter methods, we used the Hyperband [22] implementation provided in [23] and posted the results of the others reported in [25]. For all the hyperparameter schedule methods, we ran the experiments in the same computing environment. STN usually converges within 250 (150) epochs on the CIFAR-10 (PTB) dataset. Thus, we set T as 250 and 150 for all the population based methods on CIFAR-10 and PTB, respectively.