Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation
Authors: Lin Zhang, Shaohuai Shi, Bo Li
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05 and 2.42 compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively. |
| Researcher Affiliation | Academia | 1Hong Kong University of Science and Technology, 2Harbin Institute of Technology, Shenzhen |
| Pseudocode | No | The paper describes the algorithm using mathematical derivations and textual explanations but does not include a formal pseudocode block or algorithm box. |
| Open Source Code | Yes | We implement our algorithm atop Py Torch framework and provide easy-to-use APIs so that users can adopt it by adding several lines of code in their training scripts. The code is available at https://github.com/lzhangbv/eva. |
| Open Datasets | Yes | We conduct our experiments on three commonly used datasets: Cifar-10 (Krizhevsky, 2009), Cifar-100 (Krizhevsky, 2009), and Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | The Cifar-10/100 has 50,000 training images and 10,000 validation images. The Image Net has 1.3M training images and 50,000 validation images. |
| Hardware Specification | Yes | We conduct our experiment on a 32-GPU cluster. It consists of 8 nodes connected 10Gb/s Ethernet, and each node has 4 Nvidia RTX2080Ti GPUs connected by two Intel(R) Xeon(R) Gold 6230 CPUs, 512GB memory, and PCIe3.0x16. |
| Software Dependencies | Yes | We use some common software including Py Torch1.10.0, Horovod-0.21.0, CUDA-10.2, cu DNN-7.6, and NCCL-2.6.4. |
| Experiment Setup | Yes | We set the same hyper-parameters for all algorithms for a fair comparison, and the details are given in Appendix C.1. ... In training VGG-19, Res Net-110, WRN-28-10 on Cifar-10 and Cifar-100 with SGD, KFAC, and Eva, following (Pauloski et al., 2020), we set the mini-batch size to 512, learning rate to 0.4, and weight decay to 5e-4. We apply the multi-step learning rate schedule (a linear warmup at the first 5 epochs and learning rate decays by a factor of 10 at 35%, 75%, and 90% epochs). For K-FAC and Eva, we set damping to 0.03, running average to 0.95, and KL-clip to 0.001. The second-order update interval of K-FAC is 10. |