Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Authors: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, Olivier Bachem
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of GKD for distilling auto-regressive T5 language models for task-specific distillation on summarization, translation, and reasoning tasks, and task-agnostic distillation for instruction tuning. In this section, we evaluate GKD for distilling language models, a typical class of auto-regressive sequence models, on abstractive summarization, machine translation, and arithmetic reasoning. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Mila 3University of Toronto |
| Pseudocode | Yes | Algorithm 1 Generalized Knowledge Distillation (GKD) |
| Open Source Code | No | The paper mentions using 'open-sourced T5 models' and provides a link to their pretrained checkpoints, but it does not state that the code for the methodology described in this paper is open-source or provide a link to it. |
| Open Datasets | Yes | We use the XSum dataset (Narayan et al., 2018)..., we consider the task on translating English to German using WMT14 en-de (Bojar et al., 2014)., we evaluate GKD on GSM8K (Cobbe et al., 2021)..., Our distillation process utilizes the comprehensive FLAN2021 instruction tuning dataset... |
| Dataset Splits | Yes | We use the XSum dataset (Narayan et al., 2018)... We evaluate performance using ROUGE-2 score (Lin, 2004) of predicted summaries on the validation split of XSum but observe similar trends in ROUGE-L and ROUGE-1. We report performance on the validation split using the BLEU score for WMT14 en-de (Bojar et al., 2014). |
| Hardware Specification | Yes | All methods including baselines start from the supervised fine-tuned student checkpoint, which requires training for a few hours on the smallest TPUv3 (8 cores). |
| Software Dependencies | No | The paper mentions using 'the Adafactor optimizer (Shazeer & Stern, 2018)' but does not specify version numbers for other key software components like programming languages or libraries. |
| Experiment Setup | Yes | Table A.1: Hyperparameter Details for experiments on XSum., Table A.2: Hyperparameter Details for experiments on GSM8K., Table A.3: Hyperparameter Details for WMT en-de experiments., Table A.4: Hyperparameter Details for FLAN Instruction Tuning. |