Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees

Authors: Jue WANG, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher Ré, Ce Zhang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated AQ-SGD to fine-tune language models with up to 1.5 billion parameters, compressing activation to 2-4 bits. AQ-SGD provides up to 4.3 end-to-end speed-up in slower networks, without sacrificing model quality. and We conduct extensive experiments on sequence classification and language modeling datasets using De BERTa-1.5B and GPT2-1.5B models, respectively.
Researcher Affiliation Collaboration Jue Wang1,3 , Binhang Yuan1 , Luka Rimanic1 , Yongjun He1, Tri Dao2, Beidi Chen34, Christopher Ré2, Ce Zhang1 1ETH Zürich, Switzerland 2Stanford University, USA 3Zhejiang University, China 4Carnegie Mellon University 5Meta AI {juewang, binhang.yuan, luka.rimanic, yongjun.he, ce.zhang}@inf.ethz.ch {beidic, trid, chrismre}@stanford.edu Equal contribution. Now at Google.
Pseudocode Yes Algorithm 1 AQ-SGD Algorithm
Open Source Code Yes Our code is available at: https://github.com/DS3Lab/AC-SGD.
Open Datasets Yes All datasets are publicly available and do not contain sensitive or offensive content. Detailed setup can be found in Appendix ??. For sequence classification, we fine-tune a 1.5B parameter De BERTa4 on two datasets: QNLI and Co LA. For language modeling, we fine-tune the GPT2 model with 1.5B parameters5 on two datasets: Wiki Text2 and ar Xiv abstracts.
Dataset Splits No The paper mentions using QNLI, CoLA, WikiText2, and arXive abstracts datasets but does not explicitly provide details about training, validation, or test data splits in the main body. It defers to an appendix for 'Detailed setup', which is not provided in this context.
Hardware Specification Yes We conduct our experiments on AWS with 8-32 p3.2xlarge instances, each containing a V100 GPU.
Software Dependencies No The paper mentions that the system was built on 'PyTorch' and 'CuPy' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We perform grid search to choose learning rate from {2.5e-6, 3e-6, 5e-6, 1e-5} and macro-batch size from {32, 64, 96} for best model performance. We train all models using the Adam optimizer with weight decay.