MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates

Authors: Mohammad Mozaffari, Sikan Li, Zhao Zhang, Maryam Mehri Dehnavi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs.
Researcher Affiliation Academia Mohammad Mozaffari Department of Computer Science University of Toronto mmozaffari@cs.toronto.edu Sikan Li Texas Advanced Computing Server sli@tacc.utexas.edu Zhao Zhang Department of Electrical and Computer Engineering Rutgers University zhao.zhang@rutgers.edu Maryam Mehri Dehnavi Department of Computer Science University of Toronto mmehride@cs.toronto.edu
Pseudocode Yes Algorithm 1: MKOR Algorithm for a Single Layer m
Open Source Code Yes Our code base is publicly available on https://github.com/Mohammad-Mozaffari/mkor, and the instructions for running each experiment are available there.
Open Datasets Yes For the pre-training process, the English Wikipedia [30] and the Toronto Book Corpus [34] dataset, which was used in the original BERT pre-training, are used; the latter dataset is not thoroughly available which results in a small reduction in the baseline accuracies achieved in our experiments from the original BERT results. Following [21], due to the time-intensive process of hyperparameter tuning for the first phase of pre-training, we report the effectiveness of MKOR in the second phase of pre-training only while using the checkpoints of the first phase generated using LAMB optimizer.
Dataset Splits Yes Alex Net [12] with more than 20M parameters on CIFAR-100 [11] consisting of 50K training and 10K validation images of 100 classes.
Hardware Specification Yes MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs. (...) For the BERT-Large-Uncased pre-training and fine-tuning experiments, we have used up to 64 A100 GPUs on the Polaris [3] cluster which has 560 nodes, each with 4 NVIDIA A-100 GPUs with NVLink interconnects.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch version, TensorFlow version, CUDA version).
Experiment Setup Yes For the BERT-Large-Uncased pre-training, we use the same hyperparameters used in [18]. The factors in KAISA are updated every 50 iterations, and the factors in MKOR and MKOR-H are updated every 10 iterations. (...) For SGD and KAISA, we use the same hyperparameters used in [21]. The factors in MKOR are updated every 10 iterations, and the learning rate used there is the same is KAISA. The learning rate in MKOR decays by a factor of 2 at the end of epochs 25, 35, 40, 45, 50, 55, and 56.