DistiLLM: Towards Streamlined Distillation for Large Language Models

Authors: Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DISTILLM in building high-performing student models while achieving up to 4.3 speedup compared to recent KD methods.
Researcher Affiliation Collaboration 1KAIST AI, Seoul, Republic of Korea 2Microsoft, Redmond, Washington, USA. Correspondence to: Se-Young Yun <yunseyoung@kaist.ac.kr>.
Pseudocode Yes Algorithm 1 Training pipeline of DISTILLM
Open Source Code Yes https://github.com/jongwooko/distillm
Open Datasets Yes We first construct the training data from databricks-dolly-15k (Conover et al., 2023)...", "We also add a language modeling (Radford et al., 2018) loss to the Open Web Text (Gokaslan et al., 2019) corpus for all experiments.", "SAMSum (Gliwa et al., 2019) and IWSLT 2017 (Cettolo et al., 2017).
Dataset Splits Yes wherein we randomly select 14K samples for training and equally leave 500 samples for validation and testing, respectively.
Hardware Specification Yes For training the teacher and student models, we used four A100 40GB GPUs for the instruction-following task and four RTX 3090 GPUs for the text summarization and machine translation tasks.
Software Dependencies No No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned.
Experiment Setup Yes For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32} within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs.