8-bit Optimizers via Block-wise Quantization

Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, Image Net classification, WMT 14 machine translation, Mo Co v2 contrastive Image Net pretraining+finetuning, and Ro BERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. and We open-source our custom CUDA kernels and provide a Py Torch implementation that enables 8-bit optimization by changing two lines of code.
Open Datasets Yes We report on benchmarks in neural machine translation (Ott et al., 2018) trained on WMT 16 (Sennrich et al., 2016) and evaluated on en-de WMT 14 (Mach aˇcek and Bojar, 2014), large-scale language modeling (Lewis et al., 2021; Brown et al., 2020) and Ro BERTa pretraining (Liu et al., 2019) on English CC-100 + Ro BERTa corpus (Nagel, 2016; Gokaslan and Cohen, 2019; Zhu et al., 2015; Wenzek et al., 2020), finetuning the pretrained masked language model Ro BERTa (Liu et al., 2019) on GLUE (Wang et al., 2018a), Res Net-50 v1.5 image classification (He et al., 2016) on Image Net-1k (Deng et al., 2009), and Mo Co v2 contrastive image pretraining and linear finetuning (Chen et al., 2020b) on Image Net-1k (Deng et al., 2009).
Dataset Splits Yes We consistently report replication results for each benchmark with public codebases and report median accuracy, perplexity, or BLEU over ten random seeds for GLUE, three random seeds for others tasks, and a single random seed for large scale language modeling. and To test optimization stability for small-scale language modeling, we run each setting with different hyperparameters and report median performance across all successful runs. A successful run is a run that does not crash due to exploding gradients or diverges in the loss.
Hardware Specification Yes Time is total GPU time on V100 GPUs, except for Ro BERTa and GPT3 pretraining, which were done on A100 GPUs.
Software Dependencies No The paper mentions 'Py Torch implementation' and 'CUDA kernels' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes We do not change any hyperparameters or precision of weights, gradients, and activations/input gradients for each experimental setting compared to the public baseline the only change is to replace 32-bit optimizers with 8-bit optimizers. and Key hyperparameters include 10 layers with a model dimension of 1024, a fully connected hidden dimension of 8192, 16 heads, and input sub-sequences with a length of 512 tokens each. and We use the hyperparameters ϵ {1e-8, 1e-7, 1e-6}, β1 {0.90, 0.87, 0.93}, β2 {0.999, 0.99, 0.98} and small changes in learning rates.