reproducibilityindex.ai

4-bit Shampoo for Memory-Efficient Network Training

Authors: Sike Wang, Pan Zhou, Jia Li, Hua Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient*.
Researcher Affiliation	Academia	Sike Wang Beijing Normal University sikewang@mail.bnu.edu.cn Pan Zhou Singapore Management University panzhou@smu.edu.sg Jia Li Beijing Normal University jiali@bnu.edu.cn Hua Huang Beijing Normal University huahuang@bnu.edu.cn
Pseudocode	Yes	Algorithm 1 PU(λ, U, M) Input: singular value vector λ, quantized eigenvector matrix U, M, number of iterations t1 for rectification, exponential decay rate β (0, 1), Q and D
Open Source Code	Yes	Code is available at https://github.com/Sike-Wang/low-bit-Shampoo.
Open Datasets	Yes	We train VGG19 [36], Res Net34 [20], Vi T-Small [10], and Swin-Tiny [28] on the CIFAR-100 [23] and Tiny-Image Net [24] datasets with one RTX3060Ti GPU, and train Res Net50 and Vi T-Base/32 on the Image Net-1k dataset [34] with one A800 GPU.
Dataset Splits	No	The paper mentions using "validation loss" and "test accuracy" but does not explicitly provide the specific percentages or methodology for training, validation, and test dataset splits for all experiments.
Hardware Specification	Yes	We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset.
Software Dependencies	Yes	We use one RTX3060Ti GPU under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the CIFAR-100 and Tiny-Image Net datasets, use one A800 GPU under the Py Torch 2.0.1+CUDA11.7 framework for DNN training on the Image Net-1k and C4 datasets, and use two NVIDIA L40S GPUs under the Py Torch 2.0.1+CUDA11.8 framework for DNN training on the OWT dataset.
Experiment Setup	Yes	For SGDM, we set the momentum to 0.9 and use an initial learning rate of 0.1. For Adagrad, we set ϵ = 10 10 and use an initial learning rate of 0.01. For Adam W, we set β1 = 0.9, β2 = 0.999, and ϵ = 10 8 and use an initial learning rate of 0.001.