Training Stronger Baselines for Learning to Optimize

Authors: Tianlong Chen, Weiyi Zhang, Zhou Jingyang, Shiyu Chang, Sijia Liu, Lisa Amini, Zhangyang Wang

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our improved training techniques with a variety of state-of-the-art L2O models and immediately boost their performance, without making any change to their model structures. We demonstrate that, using our improved training techniques, one of the earliest and simplest L2O models [1] can be trained to outperform even the latest and most complex L2O models on a number of tasks. Our results demonstrate a greater potential of L2O yet to be unleashed, and prompt a reconsideration of recent L2O model progress. Our codes are publicly available at: https://github.com/VITA-Group/L2O-Training-Techniques.
Researcher Affiliation Collaboration 1University of Texas at Austin, 2Shanghai Jiao Tong University, 3University of Science and Technology of China, 4MIT-IBM Watson AI Lab, IBM Research
Pseudocode Yes Algorithm 1: Curriculum Learning for Training Learnable optimizer (L2O); Algorithm 2: Imitation Learning for L2O
Open Source Code Yes Our codes are publicly available at: https://github.com/VITA-Group/L2O-Training-Techniques.
Open Datasets Yes we train the optimizer on the same single optimizee as in [1], which uses the cross-entropy loss on top of a simple Multi-layer Perceptron (MLP) with one hidden layer of 20 dimensions and the sigmoid activation function on the MNIST dataset. ... Conv-CIFAR: the above CNN trained on CIFAR-10 ... NAS-CIFAR, is taken from the popular NAS-Bench-201 search space [43].
Dataset Splits Yes Validation also follows to use the same optimizee as in [1]. We progressively increase Ntrain and Nvalid during training, until the validation loss stops to decrease. ... Lval =val. loss with φ and Ni valid; if Lmin > Lval then Lmin = Lval, φ = φ , stop = False; ... Before starting training with N(i+1) train , we first validate the previous best model with N(i+1) valid as the baseline validation loss of the (i + 1)th training stage.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions optimizers like Adam and RMSProp but does not provide version numbers for any software dependencies or libraries.
Experiment Setup Yes L2O are trained by default meta optimizers with the best hyperparameters provided by each baseline: L2O-DM [1] and RNNprop [7] are optimized by Adam with an initial learning rate of 1e-3; L2O-Scale [8] and its variants are optimized by RMSProp with an initial learning rate of 1e-6. The optimizee parameters are initialized by a random normal distribution of standard deviation of 0.01. The batch size for optimizees is 128. The unroll length is fixed to 20 except for the meta learning baseline of L2O-Scale where both the number of optimization steps and unroll lengths are sampled from long-tail distributions [8].