Token-Scaled Logit Distillation for Ternary Weight Generative Language Models
Authors: Minsoo Kim, Sihwa Lee, Janghwan Lee, Sukjin Hong, Du-Seong Chang, Wonyong Sung, Jungwook Choi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate TSLD across a range of GLMs originating from GPT-2 [2], OPT [4] and LLa MA [5] of various sizes, including 7 billion models for the first time. The results show that TSLD achieves comparable, if not superior, performance in language modeling on ternary and 4-bit inference. When TSLD is applied to reasoning tasks, it surprisingly prevents overfitting to achieve task accuracy that is at least on par, if not better. These remarkable outcomes underline the potential of our proposed TSLD method in facilitating the deployment of ultra-low precision GLMs. |
| Researcher Affiliation | Collaboration | Minsoo Kim1 Sihwa Lee1 Janghwan Lee1 Sukjin Hong1,2 Du-Seong Chang2 Wonyong Sung3 Jungwook Choi1 1Hanyang University, Seoul, Republic of Korea 2KT, Seoul, Republic of Korea 3Seoul National University, Seoul, Republic of Korea |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/aiha-lab/TSLD |
| Open Datasets | Yes | We evaluate our proposed method for language modeling (PTB [34]), commonsense QA tasks (PIQA [35], Openbook QA [36], ARC_easy [37], ARC_challenge [37] and arithmetic reasoning based text-generation task (GSM8K [38]). Additionally, our assessment extends to Natural Language Understanding (NLU) task (GLUE [39]). |
| Dataset Splits | No | The paper describes data preprocessing and fine-tuning strategies but does not provide specific percentages, counts, or explicit standard references for training, validation, and test splits needed for reproduction. |
| Hardware Specification | Yes | Experiments are conducted on an A100-40GB GPU. Our kernel eliminates the need for weight unpacking during the model forward pass, resulting in a speedup shown in Table 4. We tested our kernel mainly on models larger than 6.7B, where weight load overhead is notably high. The reported times are the average execution time for 10,000 kernel runs on a single A100-80GB GPU. |
| Software Dependencies | No | Our experimental implementation utilizes the Huggingface language modeling code base. For the FP32 baseline, we used Py Torch s nn.Linear. The paper does not provide specific version numbers for Huggingface, PyTorch, or other key software dependencies. |
| Experiment Setup | Yes | Learning Rate (FP) 1e-4, 5e-5; Epoch (FP) 3, 1; Learning Rate (QAT) 1e-4, 7e-5; Epoch (QAT) 90, 60, 30, 10, 5. All experiments consistently set a batch size of 4, and sequence length of 512 in language modeling fine-tuning. Following the dynamic scaling method of Quant GPT, we determine the clipping value α for quantization by multiplying the average weight magnitude w 1/n with a learnable scale factor γ, where 1 denotes ℓ1 norm: α = γ w 1/n. In this case, the initial value for the γ is set to 1, and the learning rate for γ is 0.0002. |