Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
Authors: Wanyun Cui, Qianle Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of Cherry Q. Cherry Q outperforms existing quantization approaches in terms of perplexity and downstream task performance. |
| Researcher Affiliation | Academia | Wanyun Cui* , Qianle Wang Shanghai University of Finance and Economics Mo E Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics cui.wanyun@sufe.edu.cn, wql20000111@stu.sufe.edu.cn |
| Pseudocode | Yes | Algorithm 1 Cherry Q |
| Open Source Code | Yes | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have attached the codes in the submission. |
| Open Datasets | Yes | For the quantization of the base LLMs, we follow [9] to use C4 [20] as the training data. We selected the first four partitions of C4 and chose data with a length of 2048 tokens, resulting in a total of 50k samples of 2048 tokens. For the chat LLMs, since Vicuna-1.5 [5] is obtained by supervised fine-tuning based on Share GPT [5], we also use the Share GPT dataset for training. |
| Dataset Splits | Yes | We selected the first four partitions of C4 and chose data with a length of 2048 tokens, resulting in a total of 50k samples of 2048 tokens. |
| Hardware Specification | Yes | For all LLM scales (7B, 13B), and both base models and chat models (LLa MA2, Vicuna-v1.5), we train the models on a single node with 8 x A100 80Gi B GPUs. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers, such as Python or PyTorch versions, that are needed to replicate the experiment. |
| Experiment Setup | Yes | We use a total batch size of 128, a learning rate of 2e-5, a weight decay of 0.0, a cosine scheduler with 5% warm-up steps. The final learning rate is 25% of the peak learning rate for 2/3-bit LLMs, 10% for 4-bit LLMs. We train 1 epoch on base models, 2 epochs on chat models. |