reproducibilityindex.ai

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Authors: Chunting Zhou, Jiatao Gu, Graham Neubig

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we ﬁrst design systematic experiments to investigate why knowledge distillation is crucial in NAT training. We ﬁnd that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the complexity of the distilled data that provides the best translation quality. Based on these ﬁndings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve state-of-the-art performance for NAT-based models, and close the gap with the autoregressive baseline on the WMT14 En-De benchmark.
Researcher Affiliation	Collaboration	Chunting Zhou1 , Jiatao Gu2 , Graham Neubig1 Language Technologies Institute, Carnegie Mellon University1 Facebook AI Research2
Pseudocode	No	The paper describes its methods and processes in narrative text and equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code is released at https://github.com/pytorch/fairseq/tree/master/examples/ nonautoregressive_translation.
Open Datasets	Yes	Data. We use the data set commonly used by prior work as our evaluation benchmark: WMT14 English-German (En-De)4. 4http://www.statmt.org/wmt14/translation-task.html
Dataset Splits	Yes	Data. We use the data set commonly used by prior work as our evaluation benchmark: WMT14 English-German (En-De)4. We use newstest2013 as the validation set for selecting the best model, and newstest2014 as the test set.
Hardware Specification	No	The paper states 'All the models are run on 8 GPUs for 300, 000 updates...' and 'all the other models are trained on 8 GPUs...'. However, it does not specify the model or type of GPU, or any other detailed hardware specifications.
Software Dependencies	No	The paper mentions 'fairseq (Ott et al., 2019)' and 'Adam optimizer (Kingma & Ba, 2014)'. While software names are provided, no specific version numbers are given for these or other key software components to ensure reproducibility.
Experiment Setup	Yes	For all experiments, we adopt the Adam optimizer (Kingma & Ba, 2014) using β1 = 0.9, β2 = 0.98, ϵ = 1e 8. The learning rate is scheduled using inverse sqrt with a maximum learning rate 0.0005 and 4000 warmup steps. We set the label smoothing as 0.1. All the models are run on 8 GPUs for 300, 000 updates with an effective batch size of 32, 000 tokens.