Optimizer Amalgamation

Authors: Tianshu Huang, Tianlong Chen, Sijia Liu, Shiyu Chang, Lisa Amini, Zhangyang Wang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we present experiments showing the superiority of our amalgamated optimizer compared to its amalgamated components and learning to optimize baselines, and the efficacy of our variance reducing perturbations. Our code and pre-trained models are publicly available at http://github.com/VITA-Group/Optimizer Amalgamation.
Researcher Affiliation Collaboration Tianshu Huang1,2, Tianlong Chen1, Sijia Liu3, Shiyu Chang4, Lisa Amini5, Zhangyang Wang1 1University of Texas at Austin, 2Carnegie Mellon University, 3Michigan State University, 4University of California, Santa Barbara, 5MIT-IBM Watson AI Lab, IBM Research
Pseudocode Yes Algorithm 1: Distillation by Truncated Back-propagation Algorithm 2: Adversarial Weight Perturbation for Truncated Back-propagation
Open Source Code Yes Our code and pre-trained models are publicly available at http://github.com/VITA-Group/Optimizer Amalgamation.
Open Datasets Yes All datasets were accessed using Tensor Flow Datasets and have a CC-BY 4.0 license. The MNIST dataset (Le Cun & Cortes, 2010) is used during training; the other datasets are, from most to least similar, are: FMNIST: Fashion MNIST (Xiao et al., 2017). SVHN: Street View House Numbers, cropped (Netzer et al., 2011). CIFAR-10 (Krizhevsky et al., 2009).
Dataset Splits No The selection criteria is the best validation loss after 5 epochs for the Train network on MNIST, which matches the meta-training settings of the amalgamated optimizer. No specific percentages or sample counts for training/validation splits were explicitly provided.
Hardware Specification Yes All experiments were run on single nodes with 4x Nvidia 1080ti GPUs, providing us with a metabatch size of 4 simultaneous optimizations.
Software Dependencies No The paper mentions using "Tensor Flow Datasets" but does not specify version numbers for TensorFlow or any other software libraries used, which is required for reproducibility.
Experiment Setup Yes The RNNProp amalgamation target was trained using truncated backpropagation though time with a constant truncation length of 100 steps and total unroll of up to 1000 steps and meta-optimized by Adam with a learning rate of 1 10 3. For our training process, we also apply random scaling (Lv et al., 2017) and curriculum learning (Chen et al., 2020a); more details about amalgamation training are provided in Appendix C.3. During training, a batch size of 128 is used except for the Small Batch evaluation, which has a batch size of 32. The SGD learning rate is fixed at 0.01. Warmup: Instead of initializing each training optimizee with random weights, we first apply 100 steps of SGD optimization as a warmup.