reproducibilityindex.ai

DistiLLM: Towards Streamlined Distillation for Large Language Models

Authors: Jongwoo Ko, Sungnyun Kim, Tianyi Chen, Se-Young Yun

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments, including instruction-following tasks, demonstrate the effectiveness of DISTILLM in building high-performing student models while achieving up to 4.3 speedup compared to recent KD methods.
Researcher Affiliation	Collaboration	1KAIST AI, Seoul, Republic of Korea 2Microsoft, Redmond, Washington, USA. Correspondence to: Se-Young Yun <yunseyoung@kaist.ac.kr>.
Pseudocode	Yes	Algorithm 1 Training pipeline of DISTILLM
Open Source Code	Yes	https://github.com/jongwooko/distillm
Open Datasets	Yes	We first construct the training data from databricks-dolly-15k (Conover et al., 2023)...", "We also add a language modeling (Radford et al., 2018) loss to the Open Web Text (Gokaslan et al., 2019) corpus for all experiments.", "SAMSum (Gliwa et al., 2019) and IWSLT 2017 (Cettolo et al., 2017).
Dataset Splits	Yes	wherein we randomly select 14K samples for training and equally leave 500 samples for validation and testing, respectively.
Hardware Specification	Yes	For training the teacher and student models, we used four A100 40GB GPUs for the instruction-following task and four RTX 3090 GPUs for the text summarization and machine translation tasks.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., PyTorch 1.9, Python 3.8) were explicitly mentioned.
Experiment Setup	Yes	For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32} within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs.