Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

Authors: Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Wang, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies. The code and data are released at https://github.com/google-research/ google-research/tree/master/speculative_kd.
Researcher Affiliation	Collaboration	1UC Santa Barbara 2Google Cloud AI Research 3Google Deep Mind 4CMU Work done as a student researcher at Google Cloud AI Research. Correspondence to: Wenda Xu (EMAIL), Rujun Han (EMAIL), Rishabh Agarwal (EMAIL), Chen-Yu Lee (EMAIL)
Pseudocode	Yes	Algorithm 1 Speculative knowledge distillation
Open Source Code	Yes	The code and data are released at https://github.com/google-research/ google-research/tree/master/speculative_kd.
Open Datasets	Yes	We utilize the Flores-200 (Team, 2022) Assamese-to-English translation dataset for low-resource translation... For dialogue summarization, we utilize the Dialog Sum dataset (Chen et al., 2021)... For arithmetic reasoning, we utilize the GSM8K dataset (Cobbe et al., 2021)... For math instruction following, we utilize the Ultra Interact dataset (Yuan et al., 2024)... For evaluation, we employ the GSMplus (Li et al., 2024), Math (Hendrycks et al., 2021), Asdiv (Miao et al., 2020) and SVAMP (Patel et al., 2021) test sets as held-out sets.
Dataset Splits	Yes	We utilize the Flores-200 development set (997 instances) as our training set. Additionally, we split the Flores-200 testing set (1012 instances) into a development set (500 instances) and a testing set (512 instances) for evaluation... We randomly sample 1K instances from the Dialog Sum training set to create our input-output pairs (x, y). For evaluation, we employ the dataset s development set (500 instances) and test set (1500 instances)... We split the GSM8K training set into a development set (473 instances) and a training set (7K instances). We randomly sample 1K instances from the training set to create our input-output pairs (x, y). For evaluation, we employ the development set and the GSM8K test set (1319 instances)... We randomly sample 11K instances from the Ultra Interact training set, dividing them into a training set (10K instances) and a development set (1K instances).
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU types, or cloud instance names) are provided in the paper for running the experiments. The paper discusses "LLMs" and "models" without specifying the underlying physical hardware.
Software Dependencies	No	No specific software dependencies with version numbers (e.g., Python version, PyTorch version, CUDA version) are provided in the paper. The paper implies the use of standard machine learning frameworks but lacks detailed versioning for reproducibility.
Experiment Setup	Yes	For all fine-tuning process, we use learning rate 1e-5, warmup ratio 0.1 and 0.1 drop out rate for all fine-tuning processes. All SFT checkpoint is trained for three epoches and we select the checkpoint with the lowest validation loss... For both GEMMA-2B-IT and QWEN2-0.5B-IT models, we set the learning rate to 1e-5. We disabled dropout during sampling... Maximum input and output lengths were set task-specifically: (256, 256, 1024, 1024) and (256, 512, 128, 1024) for translation, arithmetic reasoning, summarization, and math instruction, respectively. All baseline models used a batch size of 8 and a gradient accumulation step of 1. Student model is trained over 375, 375, 375 and 1225 training steps for summarization, GSM, math and translation tasks respectively... We ultimately settled on temperature=0.5 and top-p=0.5... we set temperature to 0.2 for all tasks and top-p to 0.5 for summarization and 1 for other tasks.