Memory Efficient Optimizers with 4-bit States
Authors: Bingrui Li, Jianfei Chen, Jun Zhu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency. |
| Researcher Affiliation | Collaboration | Bingrui Li1, Jianfei Chen1 , Jun Zhu1 1Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University |
| Pseudocode | Yes | Algorithm 1 Compression-based Memory Efficient Optimization Framework, Algorithm 2 Compression-based Memory Efficient SGDM, Algorithm 3 Compression-based Memory Efficient Adam, Algorithm 4 Rank-1 Normalization |
| Open Source Code | Yes | Code is available at https://github.com/thu-ml/low-bit-optimizers |
| Open Datasets | Yes | We report performance metrics on standard benchmarks, including image classification (CLS) with Swin-T [31] on Image Net-1k [12], natural language understanding (NLU) by fine-tuning Ro BERTa-L [30] fine-tuning on GLUE [51], question answering (QA) by fine-tuning Ro BERTa-L on SQu AD [42, 43], natural language generation (NLG) by finetuning GPT-2 Medium [39] on E2E-NLG [35], machine translation (MT) by training Transformer Base [50]|| on WMT14 en-de [3] and LLa MA [49] fine-tuning. |
| Dataset Splits | No | The paper does not provide explicit numerical details (percentages or counts) for training/validation/test dataset splits. While it mentions taking results from the 'best epoch' (implying validation), the specific split used for validation is not provided. |
| Hardware Specification | Yes | The instruction tuning task uses two Nvidia A100 80GB GPUs. We utilize single RTX 3090 or 4090 GPU for runs of each task in GLUE datasets and four RTX 3090 or 4090 GPUs for SQu AD and SQu AD 2.0. |
| Software Dependencies | No | The paper mentions using 'PyTorch Huggingface', 'Lo RA codebase', 'Transformer-Base models with codebase', 'Swin-T models with its official codebase', and 'Alpaca codebase' but does not specify any version numbers for these software components. |
| Experiment Setup | Yes | The hyperparameters for Ro BERTa-L fine-tuning on GLUE. Dataset MNLI QNLI QQP RTE MRPC SST-2 Co LA STS-B Batch Size 32 32 32 16 16 32 16 16 LR 1e-5 1e-5 1e-5 2e-5 1e-5 1e-5 1e-5 2e-5 Warmup 7432 1986 28318 122 137 1256 320 214 Max Train Steps 123873 33112 113272 2036 2296 20935 5336 3598 Max Seq. Len. 128 128 128 512 512 512 512 512. The paper also provides hyperparameters for SQuAD and GPT-2 Medium. |