On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Authors: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, Olivier Bachem

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of GKD for distilling auto-regressive T5 language models for task-specific distillation on summarization, translation, and reasoning tasks, and task-agnostic distillation for instruction tuning. In this section, we evaluate GKD for distilling language models, a typical class of auto-regressive sequence models, on abstractive summarization, machine translation, and arithmetic reasoning.
Researcher Affiliation Collaboration 1Google Deep Mind 2Mila 3University of Toronto
Pseudocode Yes Algorithm 1 Generalized Knowledge Distillation (GKD)
Open Source Code No The paper mentions using 'open-sourced T5 models' and provides a link to their pretrained checkpoints, but it does not state that the code for the methodology described in this paper is open-source or provide a link to it.
Open Datasets Yes We use the XSum dataset (Narayan et al., 2018)..., we consider the task on translating English to German using WMT14 en-de (Bojar et al., 2014)., we evaluate GKD on GSM8K (Cobbe et al., 2021)..., Our distillation process utilizes the comprehensive FLAN2021 instruction tuning dataset...
Dataset Splits Yes We use the XSum dataset (Narayan et al., 2018)... We evaluate performance using ROUGE-2 score (Lin, 2004) of predicted summaries on the validation split of XSum but observe similar trends in ROUGE-L and ROUGE-1. We report performance on the validation split using the BLEU score for WMT14 en-de (Bojar et al., 2014).
Hardware Specification Yes All methods including baselines start from the supervised fine-tuned student checkpoint, which requires training for a few hours on the smallest TPUv3 (8 cores).
Software Dependencies No The paper mentions using 'the Adafactor optimizer (Shazeer & Stern, 2018)' but does not specify version numbers for other key software components like programming languages or libraries.
Experiment Setup Yes Table A.1: Hyperparameter Details for experiments on XSum., Table A.2: Hyperparameter Details for experiments on GSM8K., Table A.3: Hyperparameter Details for WMT en-de experiments., Table A.4: Hyperparameter Details for FLAN Instruction Tuning.