Memory Efficient Optimizers with 4-bit States

Authors: Bingrui Li, Jianfei Chen, Jun Zhu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.
Researcher Affiliation Collaboration Bingrui Li1, Jianfei Chen1 , Jun Zhu1 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University
Pseudocode Yes Algorithm 1 Compression-based Memory Efficient Optimization Framework, Algorithm 2 Compression-based Memory Efficient SGDM, Algorithm 3 Compression-based Memory Efficient Adam, Algorithm 4 Rank-1 Normalization
Open Source Code Yes Code is available at https://github.com/thu-ml/low-bit-optimizers
Open Datasets Yes We report performance metrics on standard benchmarks, including image classification (CLS) with Swin-T [31] on Image Net-1k [12], natural language understanding (NLU) by fine-tuning Ro BERTa-L [30] fine-tuning on GLUE [51], question answering (QA) by fine-tuning Ro BERTa-L on SQu AD [42, 43], natural language generation (NLG) by finetuning GPT-2 Medium [39] on E2E-NLG [35], machine translation (MT) by training Transformer Base [50]|| on WMT14 en-de [3] and LLa MA [49] fine-tuning.
Dataset Splits No The paper does not provide explicit numerical details (percentages or counts) for training/validation/test dataset splits. While it mentions taking results from the 'best epoch' (implying validation), the specific split used for validation is not provided.
Hardware Specification Yes The instruction tuning task uses two Nvidia A100 80GB GPUs. We utilize single RTX 3090 or 4090 GPU for runs of each task in GLUE datasets and four RTX 3090 or 4090 GPUs for SQu AD and SQu AD 2.0.
Software Dependencies No The paper mentions using 'PyTorch Huggingface', 'Lo RA codebase', 'Transformer-Base models with codebase', 'Swin-T models with its official codebase', and 'Alpaca codebase' but does not specify any version numbers for these software components.
Experiment Setup Yes The hyperparameters for Ro BERTa-L fine-tuning on GLUE. Dataset MNLI QNLI QQP RTE MRPC SST-2 Co LA STS-B Batch Size 32 32 32 16 16 32 16 16 LR 1e-5 1e-5 1e-5 2e-5 1e-5 1e-5 1e-5 2e-5 Warmup 7432 1986 28318 122 137 1256 320 214 Max Train Steps 123873 33112 113272 2036 2296 20935 5336 3598 Max Seq. Len. 128 128 128 512 512 512 512 512. The paper also provides hyperparameters for SQuAD and GPT-2 Medium.