Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Authors: Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments show that DISTILLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
Researcher Affiliation	Collaboration	Jongwoo Ko 1 Tianyi Chen 2 Sungnyun Kim 1 Tianyu Ding 2 Luming Liang 2 Ilya Zharkov 2 Se-Young Yun 1 1KAIST AI 2Microsoft Work done as a research intern at Microsoft https://github.com/jongwooko/distillm-2
Pseudocode	Yes	Algorithm 1 Training pipeline of DISTILLM-2
Open Source Code	Yes	https://github.com/jongwooko/distillm-2
Open Datasets	Yes	We apply DISTILLM-2 on instruction-following, math reasoning, and code generation datasets. We provide detailed descriptions of the datasets used. Ultra Chat200k (instruction-following; Tunstall et al. 2023 4): This is a heavily filtered version of Ultra Chat (Ding et al., 2023), originally used to train Zephyr-7B-β (Tunstall et al., 2023). It is obtained from the original version, which consists of 1.4M dialogues generated by Chat GPT and spans a wide range of topics, by removing the dialogues that contain grammatical errors or where the assistant replies with phrases like I do not have emotions or I don t have opinions. Alpaca Eval (instruction-following; Dubois et al. 2024 5): This dataset is slight modifications (or simplification) of the Alpaca Farm evaluation set.
Dataset Splits	Yes	We first construct the training datasets, by randomly sampling 50k prompts from Ultra Chat200k (Ding et al., 2023) and use the corresponding teacher and student to generate the responses. We conduct experiments on two standard mathematical reasoning benchmarks: GSM8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). For teacher and student pairs, we select the Qwen2-Math-7B-Inst and Qwen2.5Math-7B-Inst as teacher models and Qwen2-Math-1.5B and Qwen2.5-Math-1.5B as student models, respectively. The student models are trained using 50k randomly selected samples from the Meta Math QA (Yu et al., 2024a) dataset.
Hardware Specification	Yes	For all experiments, we utilize Lo RA (low-rank adaptation; Hu et al. 2022), which one of the most popular parameter-efficient fine-tuning techniques, for training efficiency. For all models, we use the maximum batch size that fits on 4 NVIDIA A100 80GB GPUs, while matching the effective batch size with 128 by considering the batch size and gradient accumulation. For all experiments in 4, we first train the student models on training datasets with ground-truth responses using SFT, and then conduct KD for LLMs.
Software Dependencies	No	The paper mentions using the 'trl framework' for implementation, and 'GPT-4o' or 'GPT-4o-mini' as judge models, but does not specify version numbers for these software components or any other key libraries for reproducibility.
Experiment Setup	Yes	Here, we describe the hyperparameters and implementation details for training with DISTILLM-2. Our hyperparameters are shown in Table 11. Table 11. Hyperparameter values used in DISTILLM-2 experiments in 4. Hyperparameter Instruction-following Mathematical Reasoning Code generation Fine-tuning method Lo RA (r = 16) Target module for Lo RA all linear layers for self-attention and MLP layers in Transformer network Learning rate 5.0 10 5 Effective Batch Size 128 # Epochs 3 epochs 2 epochs 2 epochs Initial α0 0.1, we do not use curriculum-based update in 1st epoch. Clipping value β0 0.5