8-bit Optimizers via Block-wise Quantization
Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, Image Net classification, WMT 14 machine translation, Mo Co v2 contrastive Image Net pretraining+finetuning, and Ro BERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. and We open-source our custom CUDA kernels and provide a Py Torch implementation that enables 8-bit optimization by changing two lines of code. |
| Open Datasets | Yes | We report on benchmarks in neural machine translation (Ott et al., 2018) trained on WMT 16 (Sennrich et al., 2016) and evaluated on en-de WMT 14 (Mach aˇcek and Bojar, 2014), large-scale language modeling (Lewis et al., 2021; Brown et al., 2020) and Ro BERTa pretraining (Liu et al., 2019) on English CC-100 + Ro BERTa corpus (Nagel, 2016; Gokaslan and Cohen, 2019; Zhu et al., 2015; Wenzek et al., 2020), finetuning the pretrained masked language model Ro BERTa (Liu et al., 2019) on GLUE (Wang et al., 2018a), Res Net-50 v1.5 image classification (He et al., 2016) on Image Net-1k (Deng et al., 2009), and Mo Co v2 contrastive image pretraining and linear finetuning (Chen et al., 2020b) on Image Net-1k (Deng et al., 2009). |
| Dataset Splits | Yes | We consistently report replication results for each benchmark with public codebases and report median accuracy, perplexity, or BLEU over ten random seeds for GLUE, three random seeds for others tasks, and a single random seed for large scale language modeling. and To test optimization stability for small-scale language modeling, we run each setting with different hyperparameters and report median performance across all successful runs. A successful run is a run that does not crash due to exploding gradients or diverges in the loss. |
| Hardware Specification | Yes | Time is total GPU time on V100 GPUs, except for Ro BERTa and GPT3 pretraining, which were done on A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' and 'CUDA kernels' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We do not change any hyperparameters or precision of weights, gradients, and activations/input gradients for each experimental setting compared to the public baseline the only change is to replace 32-bit optimizers with 8-bit optimizers. and Key hyperparameters include 10 layers with a model dimension of 1024, a fully connected hidden dimension of 8192, 16 heads, and input sub-sequences with a length of 512 tokens each. and We use the hyperparameters ϵ {1e-8, 1e-7, 1e-6}, β1 {0.90, 0.87, 0.93}, β2 {0.999, 0.99, 0.98} and small changes in learning rates. |