Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MiniLLM: Knowledge Distillation of Large Language Models
Authors: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in the instruction-following setting show that MINILLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. |
| Researcher Affiliation | Collaboration | Yuxian Gu1,2 , Li Dong2, Furu Wei2, Minlie Huang1 1The Co AI Group, Tsinghua University 2Microsoft Research |
| Pseudocode | Yes | Algorithm 1 MINILLM: Knowledge Distillation of LLMs |
| Open Source Code | Yes | Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm. |
| Open Datasets | Yes | We construct the training data from databricks-dolly-15K3 consisting of 15K human-written instruction-response pairs. (...) 3https://github.com/databrickslabs/dolly/tree/master |
| Dataset Splits | Yes | Then, we randomly split 0.5K and 1K samples for validation and testing, respectively, leaving about 12.5K examples for training. |
| Hardware Specification | Yes | Our experiments are based on the NVIDIA V100 32G GPUs. |
| Software Dependencies | No | The paper does not specify version numbers for key software components such as Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Phase 2: We continuously train the model from Phase 1 as described in Algorithm B using a learning rate 5e-6, a mini-batch size 64 in all cases. The clipping rate ϵ is set to 0.2, and the max length of the model is 512. We use temperature = 1 when sampling from qθ. |