Optimizer Amalgamation
Authors: Tianshu Huang, Tianlong Chen, Sijia Liu, Shiyu Chang, Lisa Amini, Zhangyang Wang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we present experiments showing the superiority of our amalgamated optimizer compared to its amalgamated components and learning to optimize baselines, and the efficacy of our variance reducing perturbations. Our code and pre-trained models are publicly available at http://github.com/VITA-Group/Optimizer Amalgamation. |
| Researcher Affiliation | Collaboration | Tianshu Huang1,2, Tianlong Chen1, Sijia Liu3, Shiyu Chang4, Lisa Amini5, Zhangyang Wang1 1University of Texas at Austin, 2Carnegie Mellon University, 3Michigan State University, 4University of California, Santa Barbara, 5MIT-IBM Watson AI Lab, IBM Research |
| Pseudocode | Yes | Algorithm 1: Distillation by Truncated Back-propagation Algorithm 2: Adversarial Weight Perturbation for Truncated Back-propagation |
| Open Source Code | Yes | Our code and pre-trained models are publicly available at http://github.com/VITA-Group/Optimizer Amalgamation. |
| Open Datasets | Yes | All datasets were accessed using Tensor Flow Datasets and have a CC-BY 4.0 license. The MNIST dataset (Le Cun & Cortes, 2010) is used during training; the other datasets are, from most to least similar, are: FMNIST: Fashion MNIST (Xiao et al., 2017). SVHN: Street View House Numbers, cropped (Netzer et al., 2011). CIFAR-10 (Krizhevsky et al., 2009). |
| Dataset Splits | No | The selection criteria is the best validation loss after 5 epochs for the Train network on MNIST, which matches the meta-training settings of the amalgamated optimizer. No specific percentages or sample counts for training/validation splits were explicitly provided. |
| Hardware Specification | Yes | All experiments were run on single nodes with 4x Nvidia 1080ti GPUs, providing us with a metabatch size of 4 simultaneous optimizations. |
| Software Dependencies | No | The paper mentions using "Tensor Flow Datasets" but does not specify version numbers for TensorFlow or any other software libraries used, which is required for reproducibility. |
| Experiment Setup | Yes | The RNNProp amalgamation target was trained using truncated backpropagation though time with a constant truncation length of 100 steps and total unroll of up to 1000 steps and meta-optimized by Adam with a learning rate of 1 10 3. For our training process, we also apply random scaling (Lv et al., 2017) and curriculum learning (Chen et al., 2020a); more details about amalgamation training are provided in Appendix C.3. During training, a batch size of 128 is used except for the Small Batch evaluation, which has a batch size of 32. The SGD learning rate is fixed at 0.01. Warmup: Instead of initializing each training optimizee with random weights, we first apply 100 steps of SGD optimization as a warmup. |