MiniLLM: Knowledge Distillation of Large Language Models

Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in the instruction-following setting show that MINILLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines.
Researcher Affiliation Collaboration Yuxian Gu1,2 , Li Dong2, Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research
Pseudocode Yes Algorithm 1 MINILLM: Knowledge Distillation of LLMs
Open Source Code Yes Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.
Open Datasets Yes We construct the training data from databricks-dolly-15K3 consisting of 15K human-written instruction-response pairs. (...) 3https://github.com/databrickslabs/dolly/tree/master
Dataset Splits Yes Then, we randomly split 0.5K and 1K samples for validation and testing, respectively, leaving about 12.5K examples for training.
Hardware Specification Yes Our experiments are based on the NVIDIA V100 32G GPUs.
Software Dependencies No The paper does not specify version numbers for key software components such as Python, PyTorch, or CUDA.
Experiment Setup Yes Phase 2: We continuously train the model from Phase 1 as described in Algorithm B using a learning rate 5e-6, a mini-batch size 64 in all cases. The clipping rate ϵ is set to 0.2, and the max length of the model is 512. We use temperature = 1 when sampling from qθ.