Training Stronger Baselines for Learning to Optimize
Authors: Tianlong Chen, Weiyi Zhang, Zhou Jingyang, Shiyu Chang, Sijia Liu, Lisa Amini, Zhangyang Wang
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our improved training techniques with a variety of state-of-the-art L2O models and immediately boost their performance, without making any change to their model structures. We demonstrate that, using our improved training techniques, one of the earliest and simplest L2O models [1] can be trained to outperform even the latest and most complex L2O models on a number of tasks. Our results demonstrate a greater potential of L2O yet to be unleashed, and prompt a reconsideration of recent L2O model progress. Our codes are publicly available at: https://github.com/VITA-Group/L2O-Training-Techniques. |
| Researcher Affiliation | Collaboration | 1University of Texas at Austin, 2Shanghai Jiao Tong University, 3University of Science and Technology of China, 4MIT-IBM Watson AI Lab, IBM Research |
| Pseudocode | Yes | Algorithm 1: Curriculum Learning for Training Learnable optimizer (L2O); Algorithm 2: Imitation Learning for L2O |
| Open Source Code | Yes | Our codes are publicly available at: https://github.com/VITA-Group/L2O-Training-Techniques. |
| Open Datasets | Yes | we train the optimizer on the same single optimizee as in [1], which uses the cross-entropy loss on top of a simple Multi-layer Perceptron (MLP) with one hidden layer of 20 dimensions and the sigmoid activation function on the MNIST dataset. ... Conv-CIFAR: the above CNN trained on CIFAR-10 ... NAS-CIFAR, is taken from the popular NAS-Bench-201 search space [43]. |
| Dataset Splits | Yes | Validation also follows to use the same optimizee as in [1]. We progressively increase Ntrain and Nvalid during training, until the validation loss stops to decrease. ... Lval =val. loss with φ and Ni valid; if Lmin > Lval then Lmin = Lval, φ = φ , stop = False; ... Before starting training with N(i+1) train , we first validate the previous best model with N(i+1) valid as the baseline validation loss of the (i + 1)th training stage. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam and RMSProp but does not provide version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | L2O are trained by default meta optimizers with the best hyperparameters provided by each baseline: L2O-DM [1] and RNNprop [7] are optimized by Adam with an initial learning rate of 1e-3; L2O-Scale [8] and its variants are optimized by RMSProp with an initial learning rate of 1e-6. The optimizee parameters are initialized by a random normal distribution of standard deviation of 0.01. The batch size for optimizees is 128. The unroll length is fixed to 20 except for the meta learning baseline of L2O-Scale where both the number of optimization steps and unroll lengths are sampled from long-tail distributions [8]. |