BitDelta: Your Fine-Tune May Only Be Worth One Bit
Authors: James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate Bit Delta through experiments across Llama-2, Mistral and MPT model families, and on models up to 70B parameters, showcasing minimal performance degradation in all tested settings. |
| Researcher Affiliation | Collaboration | James Liu1 Guangxuan Xiao1 Kai Li2 Jason D. Lee2 Song Han1,3 Tri Dao2,4 Tianle Cai2,4 1MIT 2Princeton University 3NVIDIA 4Together AI |
| Pseudocode | No | The paper describes the two stages of Bit Delta, '1-bit quantization' and 'Scale distillation,' with detailed steps and equations, but it does not present them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | https://github.com/Faster Decoding/Bit Delta |
| Open Datasets | Yes | We benchmark fine-tuned models based on the Llama-2 [53], Mistral [27], and MPT [51] model families: Vicuna, Xwin-LM, Solar-70B, Zephyr, Open Chat 3.5, Dolphin 2.2.1, and Open Orca [10, 52, 56, 55, 57, 23, 37]. We evaluate on eight tasks: MT-Bench, 25-shot ARC Challenge, 5-shot BBH, 10-shot Hella Swag, zero-shot Truthful QA, zero-shot LAMBADA, zero-shot Winogrande, and 5-shot GSM8K [66, 12, 50, 65, 34, 40, 48, 13]. |
| Dataset Splits | No | The paper mentions using well-known benchmark datasets for evaluation (e.g., MT-Bench, ARC Challenge, GSM8K) and describes few-shot or zero-shot evaluation settings, but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for these datasets as used in their experiments. |
| Hardware Specification | Yes | 1x80 GB A100 GPU is used to distill 7B and 13B models, and 6x80GB A100 GPUs are used to distill 70B models (2x for finetune, 4x for binarized). |
| Software Dependencies | No | The paper mentions 'Adam optimizer [30]', 'Fast Chat [66]', and 'lm-evaluation-harness [20]' but does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We use the Adam optimizer [30] with lr = 10 4, β = (0.9, 0.999), ϵ = 10 8. 1x80 GB A100 GPU is used to distill 7B and 13B models, and 6x80GB A100 GPUs are used to distill 70B models (2x for finetune, 4x for binarized). Scale distillation is fast; we can compress 70B models in roughly 10 minutes. |