Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
8-bit Optimizers via Block-wise Quantization
Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, Image Net classification, WMT 14 machine translation, Mo Co v2 contrastive Image Net pretraining+finetuning, and Ro BERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. and We open-source our custom CUDA kernels and provide a Py Torch implementation that enables 8-bit optimization by changing two lines of code. |
| Open Datasets | Yes | We report on benchmarks in neural machine translation (Ott et al., 2018) trained on WMT 16 (Sennrich et al., 2016) and evaluated on en-de WMT 14 (Mach aˇcek and Bojar, 2014), large-scale language modeling (Lewis et al., 2021; Brown et al., 2020) and Ro BERTa pretraining (Liu et al., 2019) on English CC-100 + Ro BERTa corpus (Nagel, 2016; Gokaslan and Cohen, 2019; Zhu et al., 2015; Wenzek et al., 2020), finetuning the pretrained masked language model Ro BERTa (Liu et al., 2019) on GLUE (Wang et al., 2018a), Res Net-50 v1.5 image classification (He et al., 2016) on Image Net-1k (Deng et al., 2009), and Mo Co v2 contrastive image pretraining and linear finetuning (Chen et al., 2020b) on Image Net-1k (Deng et al., 2009). |
| Dataset Splits | Yes | We consistently report replication results for each benchmark with public codebases and report median accuracy, perplexity, or BLEU over ten random seeds for GLUE, three random seeds for others tasks, and a single random seed for large scale language modeling. and To test optimization stability for small-scale language modeling, we run each setting with different hyperparameters and report median performance across all successful runs. A successful run is a run that does not crash due to exploding gradients or diverges in the loss. |
| Hardware Specification | Yes | Time is total GPU time on V100 GPUs, except for Ro BERTa and GPT3 pretraining, which were done on A100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch implementation' and 'CUDA kernels' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We do not change any hyperparameters or precision of weights, gradients, and activations/input gradients for each experimental setting compared to the public baseline the only change is to replace 32-bit optimizers with 8-bit optimizers. and Key hyperparameters include 10 layers with a model dimension of 1024, a fully connected hidden dimension of 8192, 16 heads, and input sub-sequences with a length of 512 tokens each. and We use the hyperparameters ϵ {1e-8, 1e-7, 1e-6}, β1 {0.90, 0.87, 0.93}, β2 {0.999, 0.99, 0.98} and small changes in learning rates. |