Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation

Authors: Lin Zhang, Shaohuai Shi, Bo Li

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05 and 2.42 compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.
Researcher Affiliation Academia 1Hong Kong University of Science and Technology, 2Harbin Institute of Technology, Shenzhen
Pseudocode No The paper describes the algorithm using mathematical derivations and textual explanations but does not include a formal pseudocode block or algorithm box.
Open Source Code Yes We implement our algorithm atop Py Torch framework and provide easy-to-use APIs so that users can adopt it by adding several lines of code in their training scripts. The code is available at https://github.com/lzhangbv/eva.
Open Datasets Yes We conduct our experiments on three commonly used datasets: Cifar-10 (Krizhevsky, 2009), Cifar-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009).
Dataset Splits Yes The Cifar-10/100 has 50,000 training images and 10,000 validation images. The Image Net has 1.3M training images and 50,000 validation images.
Hardware Specification Yes We conduct our experiment on a 32-GPU cluster. It consists of 8 nodes connected 10Gb/s Ethernet, and each node has 4 Nvidia RTX2080Ti GPUs connected by two Intel(R) Xeon(R) Gold 6230 CPUs, 512GB memory, and PCIe3.0x16.
Software Dependencies Yes We use some common software including Py Torch1.10.0, Horovod-0.21.0, CUDA-10.2, cu DNN-7.6, and NCCL-2.6.4.
Experiment Setup Yes We set the same hyper-parameters for all algorithms for a fair comparison, and the details are given in Appendix C.1. ... In training VGG-19, Res Net-110, WRN-28-10 on Cifar-10 and Cifar-100 with SGD, KFAC, and Eva, following (Pauloski et al., 2020), we set the mini-batch size to 512, learning rate to 0.4, and weight decay to 5e-4. We apply the multi-step learning rate schedule (a linear warmup at the first 5 epochs and learning rate decays by a factor of 10 at 35%, 75%, and 90% epochs). For K-FAC and Eva, we set damping to 0.03, running average to 0.95, and KL-clip to 0.001. The second-order update interval of K-FAC is 10.