reproducibilityindex.ai

Inefficiency of K-FAC for Large Batch Size Training

Authors: Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, Michael Mahoney5053-5060

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD signiﬁcantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.
Researcher Affiliation	Academia	1University of California at Berkeley, Berkeley, USA {linjian, gabe montague, yejiayu, zheweiy, amirgh, keutzer, mahoneym}@berkeley.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	To answer these questions, we conduct a comprehensive investigation in the context of image classiﬁcation on CIFAR-10 (Krizhevsky and Hinton 2009) and SVHN (Netzer et al. 2011).
Dataset Splits	No	The paper mentions training and testing but does not explicitly describe the dataset splits (e.g., percentages or counts for training, validation, and test sets) or provide specific access information for pre-defined splits.
Hardware Specification	No	The paper mentions "1024 V100 GPU machines of ABCI supercomputer" in the context of prior work (Osawa et al. 2018) but does not specify the hardware used for its own experiments.
Software Dependencies	No	The paper mentions "Pytorch" once without a version number, and does not list any other specific software dependencies with their version numbers.
Experiment Setup	Yes	For experiments on CIFAR-10, we decay the learning rate twice by a factor of ten... for large batch runs we allow a proportionally greater number of epochs to pass before learning rate decay... for CIFAR-10, number of training epochs equals (log2(batch size/128)+1) * 100; for SVHN, it equals (log2(batch size/128) + 1) * 20... For K-FAC, we conduct a log-space grid search over 64 conﬁgurations with learning rates ranging from 10e-3 to 2.187, and with damping ranging from 10e-4 to 0.2187... The decay rate for second-order statistics is held constant at 0.9 throughout training. We use update clipping as in (Ba, Grosse, and Martens 2017), with a constant parameter of 0.1. To ensure a fair comparison between methods, we employ a similarly extensive hyperparameter tuning process for SGD... learning rates range from 0.05 to 9.62, and momentum range from 0.9 to 0.999.