reproducibilityindex.ai

BitDelta: Your Fine-Tune May Only Be Worth One Bit

Authors: James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate Bit Delta through experiments across Llama-2, Mistral and MPT model families, and on models up to 70B parameters, showcasing minimal performance degradation in all tested settings.
Researcher Affiliation	Collaboration	James Liu1 Guangxuan Xiao1 Kai Li2 Jason D. Lee2 Song Han1,3 Tri Dao2,4 Tianle Cai2,4 1MIT 2Princeton University 3NVIDIA 4Together AI
Pseudocode	No	The paper describes the two stages of Bit Delta, '1-bit quantization' and 'Scale distillation,' with detailed steps and equations, but it does not present them in a structured pseudocode or algorithm block.
Open Source Code	Yes	https://github.com/Faster Decoding/Bit Delta
Open Datasets	Yes	We benchmark fine-tuned models based on the Llama-2 [53], Mistral [27], and MPT [51] model families: Vicuna, Xwin-LM, Solar-70B, Zephyr, Open Chat 3.5, Dolphin 2.2.1, and Open Orca [10, 52, 56, 55, 57, 23, 37]. We evaluate on eight tasks: MT-Bench, 25-shot ARC Challenge, 5-shot BBH, 10-shot Hella Swag, zero-shot Truthful QA, zero-shot LAMBADA, zero-shot Winogrande, and 5-shot GSM8K [66, 12, 50, 65, 34, 40, 48, 13].
Dataset Splits	No	The paper mentions using well-known benchmark datasets for evaluation (e.g., MT-Bench, ARC Challenge, GSM8K) and describes few-shot or zero-shot evaluation settings, but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) for these datasets as used in their experiments.
Hardware Specification	Yes	1x80 GB A100 GPU is used to distill 7B and 13B models, and 6x80GB A100 GPUs are used to distill 70B models (2x for finetune, 4x for binarized).
Software Dependencies	No	The paper mentions 'Adam optimizer [30]', 'Fast Chat [66]', and 'lm-evaluation-harness [20]' but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	We use the Adam optimizer [30] with lr = 10 4, β = (0.9, 0.999), ϵ = 10 8. 1x80 GB A100 GPU is used to distill 7B and 13B models, and 6x80GB A100 GPUs are used to distill 70B models (2x for finetune, 4x for binarized). Scale distillation is fast; we can compress 70B models in roughly 10 minutes.