reproducibilityindex.ai

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Authors: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, Olivier Bachem

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efﬁcacy of GKD for distilling auto-regressive T5 language models for task-speciﬁc distillation on summarization, translation, and reasoning tasks, and task-agnostic distillation for instruction tuning. In this section, we evaluate GKD for distilling language models, a typical class of auto-regressive sequence models, on abstractive summarization, machine translation, and arithmetic reasoning.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Mila 3University of Toronto
Pseudocode	Yes	Algorithm 1 Generalized Knowledge Distillation (GKD)
Open Source Code	No	The paper mentions using 'open-sourced T5 models' and provides a link to their pretrained checkpoints, but it does not state that the code for the methodology described in this paper is open-source or provide a link to it.
Open Datasets	Yes	We use the XSum dataset (Narayan et al., 2018)..., we consider the task on translating English to German using WMT14 en-de (Bojar et al., 2014)., we evaluate GKD on GSM8K (Cobbe et al., 2021)..., Our distillation process utilizes the comprehensive FLAN2021 instruction tuning dataset...
Dataset Splits	Yes	We use the XSum dataset (Narayan et al., 2018)... We evaluate performance using ROUGE-2 score (Lin, 2004) of predicted summaries on the validation split of XSum but observe similar trends in ROUGE-L and ROUGE-1. We report performance on the validation split using the BLEU score for WMT14 en-de (Bojar et al., 2014).
Hardware Specification	Yes	All methods including baselines start from the supervised ﬁne-tuned student checkpoint, which requires training for a few hours on the smallest TPUv3 (8 cores).
Software Dependencies	No	The paper mentions using 'the Adafactor optimizer (Shazeer & Stern, 2018)' but does not specify version numbers for other key software components like programming languages or libraries.
Experiment Setup	Yes	Table A.1: Hyperparameter Details for experiments on XSum., Table A.2: Hyperparameter Details for experiments on GSM8K., Table A.3: Hyperparameter Details for WMT en-de experiments., Table A.4: Hyperparameter Details for FLAN Instruction Tuning.