Inefficiency of K-FAC for Large Batch Size Training

Authors: Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney5053-5060

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD significantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.
Researcher Affiliation Academia 1University of California at Berkeley, Berkeley, USA {linjian, gabe montague, yejiayu, zheweiy, amirgh, keutzer, mahoneym}@berkeley.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes To answer these questions, we conduct a comprehensive investigation in the context of image classification on CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011).
Dataset Splits No The paper mentions training and testing but does not explicitly describe the dataset splits (e.g., percentages or counts for training, validation, and test sets) or provide specific access information for pre-defined splits.
Hardware Specification No The paper mentions "1024 V100 GPU machines of ABCI supercomputer" in the context of *prior work* (Osawa et al. 2018) but does not specify the hardware used for its *own* experiments.
Software Dependencies No The paper mentions "Pytorch" once without a version number, and does not list any other specific software dependencies with their version numbers.
Experiment Setup Yes For experiments on CIFAR-10, we decay the learning rate twice by a factor of ten... for large batch runs we allow a proportionally greater number of epochs to pass before learning rate decay... for CIFAR-10, number of training epochs equals (log2(batch size/128)+1) * 100; for SVHN, it equals (log2(batch size/128) + 1) * 20... For K-FAC, we conduct a log-space grid search over 64 configurations with learning rates ranging from 10e-3 to 2.187, and with damping ranging from 10e-4 to 0.2187... The decay rate for second-order statistics is held constant at 0.9 throughout training. We use update clipping as in (Ba, Grosse, and Martens 2017), with a constant parameter of 0.1. To ensure a fair comparison between methods, we employ a similarly extensive hyperparameter tuning process for SGD... learning rates range from 0.05 to 9.62, and momentum range from 0.9 to 0.999.