Error Feedback Can Accurately Compress Preconditioners
Authors: Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We now validate our results experimentally. In the main text, we focus on comparing Sparse M-FAC with other optimizers, as it proved to be the most practical variant of EFCP. We show results for low-rank M-FAC and Sparse GGT in the Appendix. In the following, we validate Sparse M-FAC (S-MFAC) experimentally as an optimizer for training in the context of standard vision and language modelling tasks. Specifically, we integrate our S-MFAC optimizer in ASDL (Osawa et al., 2023), a benchmarking library for second-order optimizers, but also examine larger-scale tasks such as Image Net training of Res Net-18 and compare S-MFAC with Adam W and Dense M-FAC (D-MFAC) on BERT models on GLUE tasks. We report training loss, test accuracy and total running times and top memory usage for the entire training process. |
| Researcher Affiliation | Academia | Ionut-Vlad Modoranu 1 Aleksei Kalinov 1 Eldar Kurtic 1 Elias Frantar 1 Dan Alistarh 1 Institute of Science and Technology Austria. Correspondence to: Ionut-Vlad Modoranu <ionut-vlad.modoranu@ist.ac.at>. |
| Pseudocode | Yes | Algorithm 1 EFCP: Error Feedback for Accurate Compressed Full-Matrix Preconditioning and Algorithm 2 SP kernel (CUDA) and Algorithm 3 LCG kernel (CUDA) and Algorithm 4 Detailed Sparse GGT Implementation. |
| Open Source Code | Yes | Our code is available on our Git Hub repository https://github.com/IST-DASLab/EFCP/. |
| Open Datasets | Yes | We validate our implementation experimentally on standard vision (Res Net/Image Net) and language modeling (BERT/GLUE) tasks. ... Image Net/Res Net-18. Next, we move to a more complex vision task, by training Res Net-18 on Image Net (Deng et al., 2009)... BERT/GLUE-MNLI. Finally, we test our S-MFAC implementation for BERT-TINY/MINI/BASE models on the MNLI task from the GLUE benchmark (Wang et al., 2019a)... Res Net-20/CIFAR-10 (Krizhevsky & Hinton, 2009). |
| Dataset Splits | No | The paper mentions training on datasets like ImageNet and GLUE and reports 'validation accuracy' and 'test accuracy', implying the use of validation and test sets. However, it does not explicitly provide specific details on the percentage or number of samples used for training, validation, and test splits (e.g., '80/10/10 split' or 'X samples for training'). While standard splits are implied for these well-known datasets, the paper does not concretely describe them. |
| Hardware Specification | Yes | To examine memory, we ran the experiments on a single A100 GPU and recorded the max memory usage throughout the entire process via the NVIDIA-SMI tool. and Table 3. Running times and memory usages for Image Net/Res Net18 and GLUE/BERT reported on an NVIDIA A100 GPU with 82GB RAM. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and refers to the 'Py Torch Sparse library (Fey et al., 2018)'. However, it does not specify exact version numbers for these software components (e.g., 'PyTorch 1.9'). |
| Experiment Setup | Yes | Notations. We use the notation E for number of epochs, η for learning rate, γ for weight decay, B for batch size, m for the number of gradients (sliding window size), k = 1% for the gradient density. ... All hyper-parameters are provided in the Appendix E. and Table 9. Hyper-parameters for SGD and S/D-MFAC for FFCV/Image Net. |