MiniLLM: Knowledge Distillation of Large Language Models
Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in the instruction-following setting show that MINILLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. |
| Researcher Affiliation | Collaboration | Yuxian Gu1,2 , Li Dong2, Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research |
| Pseudocode | Yes | Algorithm 1 MINILLM: Knowledge Distillation of LLMs |
| Open Source Code | Yes | Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm. |
| Open Datasets | Yes | We construct the training data from databricks-dolly-15K3 consisting of 15K human-written instruction-response pairs. (...) 3https://github.com/databrickslabs/dolly/tree/master |
| Dataset Splits | Yes | Then, we randomly split 0.5K and 1K samples for validation and testing, respectively, leaving about 12.5K examples for training. |
| Hardware Specification | Yes | Our experiments are based on the NVIDIA V100 32G GPUs. |
| Software Dependencies | No | The paper does not specify version numbers for key software components such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Phase 2: We continuously train the model from Phase 1 as described in Algorithm B using a learning rate 5e-6, a mini-batch size 64 in all cases. The clipping rate ϵ is set to 0.2, and the max length of the model is 512. We use temperature = 1 when sampling from qθ. |