HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Authors: Shipeng Wang, Jian Sun, Zongben Xu5297-5304

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM. [...] Second, extensive experiments justify that the learned Hyper Adam outperforms traditional optimizers, such as Adam and learning-based optimizers for training a wide range of neural networks, e.g., deep MLP, CNN, LSTM. [...] 5 Evaluation We have trained Hyper Adam based on 1-layer MLP (basic MLP), we now evaluate the learned Hyper Adam for more complex networks such as basic MLP with different activation functions, deeper MLP, CNN and LSTM.
Researcher Affiliation Academia Shipeng Wang, Jian Sun, Zongben Xu School of Mathematics and Statistics, Xi an Jiaotong University, Xi an, 710049, China wangshipeng8128@stu.xjtu.edu.cn, {jiansun, zbxu}@xjtu.edu.cn
Pseudocode Yes Algorithm 1 Adam Optimizer [...] Algorithm 2 Task-Adaptive Hyper Adam
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository.
Open Datasets Yes We set the learning rate α = 0.005 and maximal iteration T = 100 indicating the number of optimization steps using Hyper Adam as an optimizer. The number of candidate updates J is set to be 20. Hyper Adam can be seen as a recurrent neural network iteratively updating network parameters. Therefore we can optimize parameter Θ of Hyper Adam using Back Propagation Through Time (Werbos 1990) by minimizing L(Θ) with Adam, and the expectation with respect to L is approximated by the average training loss for learner f with different initializations. The T = 100 steps are split into 5 periods of 20 steps to avoid gradient vanishing. In each period, the initial parameter w0 and initial hidden state H are initialized from the last period or generated if it is the first period. Two training tricks proposed in (Lv, Jiang, and Li 2017) are used here. First, in order to make the training easier, a k-dimensional convex function h(z) = 1 k z η 2 is combined with the original optimizee (i.e., training loss), and this trick is called Combination with Convex Function (CC). η and initial value of z are generated randomly. Second, Random Scaling (RS), helping to avoid over-fitting, randomly samples vectors c1 and c2 of the same dimension as parameter w and z respectively, and then multiply the parameters with c1 and c2 coordinate-wisely, thus the optimizee in the meta-train set becomes: Lext(w, z) = L(f(X; c1 w), Y ) + h(c2 z), (9) with initial parameters diag(c1) 1w, diag(c2) 1z. [...] The learner f is simply taken as a forward neural network with one hidden layer of 20 units and sigmoid as activation function. The optimizee L is defined as L(f(X; w), Y ) = ΣN i=1l(f(xi; w), yi) where l is the cross entropy loss for the learner f with a minibatch of 128 random images sampled from the MNIST dataset (Le Cun et al. 1998).
Dataset Splits No The paper mentions using MNIST and CIFAR-10 datasets and describes training procedures and batch sizes, but it does not specify the train/validation/test split percentages or sample counts for these datasets. It refers to a "minibatch of 128 random images sampled from the MNIST dataset" for training, but no clear validation split is defined.
Hardware Specification No The paper mentions computation times (e.g., "0.0023s, 0.0033s and 0.0039s respectively in average") but does not specify any hardware details like CPU, GPU models, or memory used for the experiments.
Software Dependencies No The paper states that the implementation is "implemented by Tensor Flow" and mentions setting "hyperparameters as defaults in Tensor Flow", but it does not provide specific version numbers for TensorFlow or any other software dependencies.
Experiment Setup Yes We set the learning rate α = 0.005 and maximal iteration T = 100 indicating the number of optimization steps using Hyper Adam as an optimizer. The number of candidate updates J is set to be 20. [...] The learner f is simply taken as a forward neural network with one hidden layer of 20 units and sigmoid as activation function. The optimizee L is defined as L(f(X; w), Y ) = ΣN i=1l(f(xi; w), yi) where l is the cross entropy loss for the learner f with a minibatch of 128 random images sampled from the MNIST dataset (Le Cun et al. 1998).