MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates
Authors: Mohammad Mozaffari, Sikan Li, Zhao Zhang, Maryam Mehri Dehnavi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs. |
| Researcher Affiliation | Academia | Mohammad Mozaffari Department of Computer Science University of Toronto mmozaffari@cs.toronto.edu Sikan Li Texas Advanced Computing Server sli@tacc.utexas.edu Zhao Zhang Department of Electrical and Computer Engineering Rutgers University zhao.zhang@rutgers.edu Maryam Mehri Dehnavi Department of Computer Science University of Toronto mmehride@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1: MKOR Algorithm for a Single Layer m |
| Open Source Code | Yes | Our code base is publicly available on https://github.com/Mohammad-Mozaffari/mkor, and the instructions for running each experiment are available there. |
| Open Datasets | Yes | For the pre-training process, the English Wikipedia [30] and the Toronto Book Corpus [34] dataset, which was used in the original BERT pre-training, are used; the latter dataset is not thoroughly available which results in a small reduction in the baseline accuracies achieved in our experiments from the original BERT results. Following [21], due to the time-intensive process of hyperparameter tuning for the first phase of pre-training, we report the effectiveness of MKOR in the second phase of pre-training only while using the checkpoints of the first phase generated using LAMB optimizer. |
| Dataset Splits | Yes | Alex Net [12] with more than 20M parameters on CIFAR-100 [11] consisting of 50K training and 10K validation images of 100 classes. |
| Hardware Specification | Yes | MKOR outperforms state-of-the-art first-order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57 and 1.85 respectively on BERT-Large-Uncased on 64 GPUs. (...) For the BERT-Large-Uncased pre-training and fine-tuning experiments, we have used up to 64 A100 GPUs on the Polaris [3] cluster which has 560 nodes, each with 4 NVIDIA A-100 GPUs with NVLink interconnects. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as libraries or frameworks (e.g., PyTorch version, TensorFlow version, CUDA version). |
| Experiment Setup | Yes | For the BERT-Large-Uncased pre-training, we use the same hyperparameters used in [18]. The factors in KAISA are updated every 50 iterations, and the factors in MKOR and MKOR-H are updated every 10 iterations. (...) For SGD and KAISA, we use the same hyperparameters used in [21]. The factors in MKOR are updated every 10 iterations, and the learning rate used there is the same is KAISA. The learning rate in MKOR decays by a factor of 2 at the end of epochs 25, 35, 40, 45, 50, 55, and 56. |