4-bit Shampoo for Memory-Efficient Network Training

Authors: Sike Wang, Pan Zhou, Jia Li, Hua Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient*.
Researcher Affiliation Academia Sike Wang Beijing Normal University sikewang@mail.bnu.edu.cn Pan Zhou Singapore Management University panzhou@smu.edu.sg Jia Li Beijing Normal University jiali@bnu.edu.cn Hua Huang Beijing Normal University huahuang@bnu.edu.cn
Pseudocode Yes Algorithm 1 PU(λ, U, M) Input: singular value vector λ, quantized eigenvector matrix U, M, number of iterations t1 for rectification, exponential decay rate β (0, 1), Q and D
Open Source Code Yes Code is available at https://github.com/Sike-Wang/low-bit-Shampoo.
Open Datasets Yes We train VGG19 [36], Res Net34 [20], Vi T-Small [10], and Swin-Tiny [28] on the CIFAR-100 [23] and Tiny-Image Net [24] datasets with one RTX3060Ti GPU, and train Res Net50 and Vi T-Base/32 on the Image Net-1k dataset [34] with one A800 GPU.
Dataset Splits No The paper mentions using "validation loss" and "test accuracy" but does not explicitly provide the specific percentages or methodology for training, validation, and test dataset splits for all experiments.
Hardware Specification Yes We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset.
Software Dependencies Yes We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset.
Experiment Setup Yes For SGDM, we set the momentum to 0.9 and use an initial learning rate of 0.1. For Adagrad, we set ϵ = 10 10 and use an initial learning rate of 0.01. For Adam W, we set β1 = 0.9, β2 = 0.999, and ϵ = 10 8 and use an initial learning rate of 0.001.