reproducibilityindex.ai

Memory Efficient Optimizers with 4-bit States

Authors: Bingrui Li, Jianfei Chen, Jun Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classiﬁcation, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efﬁciency.
Researcher Affiliation	Collaboration	Bingrui Li1, Jianfei Chen1 , Jun Zhu1 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Pseudocode	Yes	Algorithm 1 Compression-based Memory Efﬁcient Optimization Framework, Algorithm 2 Compression-based Memory Efﬁcient SGDM, Algorithm 3 Compression-based Memory Efﬁcient Adam, Algorithm 4 Rank-1 Normalization
Open Source Code	Yes	Code is available at https://github.com/thu-ml/low-bit-optimizers
Open Datasets	Yes	We report performance metrics on standard benchmarks, including image classiﬁcation (CLS) with Swin-T [31] on Image Net-1k [12], natural language understanding (NLU) by ﬁne-tuning Ro BERTa-L [30] ﬁne-tuning on GLUE [51], question answering (QA) by ﬁne-tuning Ro BERTa-L on SQu AD [42, 43], natural language generation (NLG) by ﬁnetuning GPT-2 Medium [39] on E2E-NLG [35], machine translation (MT) by training Transformer Base [50]\|\| on WMT14 en-de [3] and LLa MA [49] ﬁne-tuning.
Dataset Splits	No	The paper does not provide explicit numerical details (percentages or counts) for training/validation/test dataset splits. While it mentions taking results from the 'best epoch' (implying validation), the specific split used for validation is not provided.
Hardware Specification	Yes	The instruction tuning task uses two Nvidia A100 80GB GPUs. We utilize single RTX 3090 or 4090 GPU for runs of each task in GLUE datasets and four RTX 3090 or 4090 GPUs for SQu AD and SQu AD 2.0.
Software Dependencies	No	The paper mentions using 'PyTorch Huggingface', 'Lo RA codebase', 'Transformer-Base models with codebase', 'Swin-T models with its ofﬁcial codebase', and 'Alpaca codebase' but does not specify any version numbers for these software components.
Experiment Setup	Yes	The hyperparameters for Ro BERTa-L ﬁne-tuning on GLUE. Dataset MNLI QNLI QQP RTE MRPC SST-2 Co LA STS-B Batch Size 32 32 32 16 16 32 16 16 LR 1e-5 1e-5 1e-5 2e-5 1e-5 1e-5 1e-5 2e-5 Warmup 7432 1986 28318 122 137 1256 320 214 Max Train Steps 123873 33112 113272 2036 2296 20935 5336 3598 Max Seq. Len. 128 128 128 512 512 512 512 512. The paper also provides hyperparameters for SQuAD and GPT-2 Medium.