On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Authors: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, Olivier Bachem
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of GKD for distilling auto-regressive T5 language models for task-specific distillation on summarization, translation, and reasoning tasks, and task-agnostic distillation for instruction tuning. In this section, we evaluate GKD for distilling language models, a typical class of auto-regressive sequence models, on abstractive summarization, machine translation, and arithmetic reasoning. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Mila 3University of Toronto |
| Pseudocode | Yes | Algorithm 1 Generalized Knowledge Distillation (GKD) |
| Open Source Code | No | The paper mentions using 'open-sourced T5 models' and provides a link to their pretrained checkpoints, but it does not state that the code for the methodology described in this paper is open-source or provide a link to it. |
| Open Datasets | Yes | We use the XSum dataset (Narayan et al., 2018)..., we consider the task on translating English to German using WMT14 en-de (Bojar et al., 2014)., we evaluate GKD on GSM8K (Cobbe et al., 2021)..., Our distillation process utilizes the comprehensive FLAN2021 instruction tuning dataset... |
| Dataset Splits | Yes | We use the XSum dataset (Narayan et al., 2018)... We evaluate performance using ROUGE-2 score (Lin, 2004) of predicted summaries on the validation split of XSum but observe similar trends in ROUGE-L and ROUGE-1. We report performance on the validation split using the BLEU score for WMT14 en-de (Bojar et al., 2014). |
| Hardware Specification | Yes | All methods including baselines start from the supervised fine-tuned student checkpoint, which requires training for a few hours on the smallest TPUv3 (8 cores). |
| Software Dependencies | No | The paper mentions using 'the Adafactor optimizer (Shazeer & Stern, 2018)' but does not specify version numbers for other key software components like programming languages or libraries. |
| Experiment Setup | Yes | Table A.1: Hyperparameter Details for experiments on XSum., Table A.2: Hyperparameter Details for experiments on GSM8K., Table A.3: Hyperparameter Details for WMT en-de experiments., Table A.4: Hyperparameter Details for FLAN Instruction Tuning. |