4-bit Shampoo for Memory-Efficient Network Training
Authors: Sike Wang, Pan Zhou, Jia Li, Hua Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient*. |
| Researcher Affiliation | Academia | Sike Wang Beijing Normal University sikewang@mail.bnu.edu.cn Pan Zhou Singapore Management University panzhou@smu.edu.sg Jia Li Beijing Normal University jiali@bnu.edu.cn Hua Huang Beijing Normal University huahuang@bnu.edu.cn |
| Pseudocode | Yes | Algorithm 1 PU(λ, U, M) Input: singular value vector λ, quantized eigenvector matrix U, M, number of iterations t1 for rectification, exponential decay rate β (0, 1), Q and D |
| Open Source Code | Yes | Code is available at https://github.com/Sike-Wang/low-bit-Shampoo. |
| Open Datasets | Yes | We train VGG19 [36], Res Net34 [20], Vi T-Small [10], and Swin-Tiny [28] on the CIFAR-100 [23] and Tiny-Image Net [24] datasets with one RTX3060Ti GPU, and train Res Net50 and Vi T-Base/32 on the Image Net-1k dataset [34] with one A800 GPU. |
| Dataset Splits | No | The paper mentions using "validation loss" and "test accuracy" but does not explicitly provide the specific percentages or methodology for training, validation, and test dataset splits for all experiments. |
| Hardware Specification | Yes | We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset. |
| Software Dependencies | Yes | We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset. |
| Experiment Setup | Yes | For SGDM, we set the momentum to 0.9 and use an initial learning rate of 0.1. For Adagrad, we set ϵ = 10 10 and use an initial learning rate of 0.01. For Adam W, we set β1 = 0.9, β2 = 0.999, and ϵ = 10 8 and use an initial learning rate of 0.001. |