DistiLLM: Towards Streamlined Distillation for Large Language Models
Authors: Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DISTILLM in building high-performing student models while achieving up to 4.3 speedup compared to recent KD methods. |
| Researcher Affiliation | Collaboration | 1KAIST AI, Seoul, Republic of Korea 2Microsoft, Redmond, Washington, USA. Correspondence to: Se-Young Yun <yunseyoung@kaist.ac.kr>. |
| Pseudocode | Yes | Algorithm 1 Training pipeline of DISTILLM |
| Open Source Code | Yes | https://github.com/jongwooko/distillm |
| Open Datasets | Yes | We first construct the training data from databricks-dolly-15k (Conover et al., 2023)...", "We also add a language modeling (Radford et al., 2018) loss to the Open Web Text (Gokaslan et al., 2019) corpus for all experiments.", "SAMSum (Gliwa et al., 2019) and IWSLT 2017 (Cettolo et al., 2017). |
| Dataset Splits | Yes | wherein we randomly select 14K samples for training and equally leave 500 samples for validation and testing, respectively. |
| Hardware Specification | Yes | For training the teacher and student models, we used four A100 40GB GPUs for the instruction-following task and four RTX 3090 GPUs for the text summarization and machine translation tasks. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned. |
| Experiment Setup | Yes | For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32} within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs. |