reproducibilityindex.ai

8-bit Optimizers via Block-wise Quantization

Authors: Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE ﬁnetuning, Image Net classiﬁcation, WMT 14 machine translation, Mo Co v2 contrastive Image Net pretraining+ﬁnetuning, and Ro BERTa pretraining, without changes to the original optimizer hyperparameters. We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change.
Researcher Affiliation	Academia	Anonymous authors Paper under double-blind review
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	We open-sourceour 8-bit optimizers as a drop-in replacement that only requires a two-line code change. and We open-source our custom CUDA kernels and provide a Py Torch implementation that enables 8-bit optimization by changing two lines of code.
Open Datasets	Yes	We report on benchmarks in neural machine translation (Ott et al., 2018) trained on WMT 16 (Sennrich et al., 2016) and evaluated on en-de WMT 14 (Mach aˇcek and Bojar, 2014), large-scale language modeling (Lewis et al., 2021; Brown et al., 2020) and Ro BERTa pretraining (Liu et al., 2019) on English CC-100 + Ro BERTa corpus (Nagel, 2016; Gokaslan and Cohen, 2019; Zhu et al., 2015; Wenzek et al., 2020), ﬁnetuning the pretrained masked language model Ro BERTa (Liu et al., 2019) on GLUE (Wang et al., 2018a), Res Net-50 v1.5 image classiﬁcation (He et al., 2016) on Image Net-1k (Deng et al., 2009), and Mo Co v2 contrastive image pretraining and linear ﬁnetuning (Chen et al., 2020b) on Image Net-1k (Deng et al., 2009).
Dataset Splits	Yes	We consistently report replication results for each benchmark with public codebases and report median accuracy, perplexity, or BLEU over ten random seeds for GLUE, three random seeds for others tasks, and a single random seed for large scale language modeling. and To test optimization stability for small-scale language modeling, we run each setting with different hyperparameters and report median performance across all successful runs. A successful run is a run that does not crash due to exploding gradients or diverges in the loss.
Hardware Specification	Yes	Time is total GPU time on V100 GPUs, except for Ro BERTa and GPT3 pretraining, which were done on A100 GPUs.
Software Dependencies	No	The paper mentions 'Py Torch implementation' and 'CUDA kernels' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	We do not change any hyperparameters or precision of weights, gradients, and activations/input gradients for each experimental setting compared to the public baseline the only change is to replace 32-bit optimizers with 8-bit optimizers. and Key hyperparameters include 10 layers with a model dimension of 1024, a fully connected hidden dimension of 8192, 16 heads, and input sub-sequences with a length of 512 tokens each. and We use the hyperparameters ϵ {1e-8, 1e-7, 1e-6}, β1 {0.90, 0.87, 0.93}, β2 {0.999, 0.99, 0.98} and small changes in learning rates.