Learning to Mutate with Hypergradient Guided Population

Authors: Zhiqiang Tao, Yaliang Li, Bolin Ding, Ce Zhang, Jingren Zhou, Yun Fu

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evidence on synthetic functions is provided to show that HPM outperforms hypergradient significantly. Experiments on two benchmark datasets are also conducted to validate the effectiveness of the proposed HPM algorithm for training deep neural networks compared with several strong baselines.
Researcher Affiliation Collaboration Zhiqiang Tao1,4 , Yaliang Li2, Bolin Ding2, Ce Zhang3, Jingren Zhou2, Yun Fu4 1Department of Computer Science and Engineering, Santa Clara University 2Alibaba Group 3Department of Computer Science, ETH Zürich 4Department of Electrical & Computer Engineering, Northeastern University
Pseudocode Yes Algorithm 1 summarizes the entire HPM scheduling algorithm.
Open Source Code No All the codes on benchmark datasets were implemented with Pytorch library.
Open Datasets Yes We tune 15 hyperparameters, including 8 dropout rates and 7 data augmentation hyperparameters for Alex Net [20] in the CIFAR10 image dataset [19], and 7 RNN regularization hyperparameters [13, 33, 29] for LSTM [15] model in the Penn Treebank (PTB) [28] corpus dataset.
Dataset Splits Yes Train step updates θk t 1 to θk t and evaluates the validation loss Lval(θk t , hk t ) for each k. One training step could be one epoch or a fixed number of iterations. An agent model is ready to be exploited and explored after one step.
Hardware Specification No For all the hyperparameter schedule methods, we ran the experiments in the same computing environment.
Software Dependencies No All the codes on benchmark datasets were implemented with Pytorch library.
Experiment Setup Yes We set the population size as 20 and the truncation selection ratio as 20% for PBT, HPM w/o T, and HPM. We employed the recommended optimizers and learning rates for all the baseline networks and STN models following [25]. Our teacher network was implemented with 64 key slots and was trained with Adam optimizer with a learning rate of 0.001. For the fixed hyperparameter methods, we used the Hyperband [22] implementation provided in [23] and posted the results of the others reported in [25]. For all the hyperparameter schedule methods, we ran the experiments in the same computing environment. STN usually converges within 250 (150) epochs on the CIFAR-10 (PTB) dataset. Thus, we set T as 250 and 150 for all the population based methods on CIFAR-10 and PTB, respectively.