Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models

Authors: Wanyun Cui, Qianle Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of Cherry Q. Cherry Q outperforms existing quantization approaches in terms of perplexity and downstream task performance.
Researcher Affiliation Academia Wanyun Cui* , Qianle Wang Shanghai University of Finance and Economics Mo E Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics cui.wanyun@sufe.edu.cn, wql20000111@stu.sufe.edu.cn
Pseudocode Yes Algorithm 1 Cherry Q
Open Source Code Yes Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have attached the codes in the submission.
Open Datasets Yes For the quantization of the base LLMs, we follow [9] to use C4 [20] as the training data. We selected the first four partitions of C4 and chose data with a length of 2048 tokens, resulting in a total of 50k samples of 2048 tokens. For the chat LLMs, since Vicuna-1.5 [5] is obtained by supervised fine-tuning based on Share GPT [5], we also use the Share GPT dataset for training.
Dataset Splits Yes We selected the first four partitions of C4 and chose data with a length of 2048 tokens, resulting in a total of 50k samples of 2048 tokens.
Hardware Specification Yes For all LLM scales (7B, 13B), and both base models and chat models (LLa MA2, Vicuna-v1.5), we train the models on a single node with 8 x A100 80Gi B GPUs.
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers, such as Python or PyTorch versions, that are needed to replicate the experiment.
Experiment Setup Yes We use a total batch size of 128, a learning rate of 2e-5, a weight decay of 0.0, a cosine scheduler with 5% warm-up steps. The final learning rate is 25% of the peak learning rate for 2/3-bit LLMs, 10% for 4-bit LLMs. We train 1 epoch on base models, 2 epochs on chat models.