Understanding Knowledge Distillation in Non-autoregressive Machine Translation
Authors: Chunting Zhou, Jiatao Gu, Graham Neubig
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we first design systematic experiments to investigate why knowledge distillation is crucial in NAT training. We find that knowledge distillation can reduce the complexity of data sets and help NAT to model the variations in the output data. Furthermore, a strong correlation is observed between the capacity of an NAT model and the complexity of the distilled data that provides the best translation quality. Based on these findings, we further propose several approaches that can alter the complexity of data sets to improve the performance of NAT models. We achieve state-of-the-art performance for NAT-based models, and close the gap with the autoregressive baseline on the WMT14 En-De benchmark. |
| Researcher Affiliation | Collaboration | Chunting Zhou1 , Jiatao Gu2 , Graham Neubig1 Language Technologies Institute, Carnegie Mellon University1 Facebook AI Research2 |
| Pseudocode | No | The paper describes its methods and processes in narrative text and equations but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is released at https://github.com/pytorch/fairseq/tree/master/examples/ nonautoregressive_translation. |
| Open Datasets | Yes | Data. We use the data set commonly used by prior work as our evaluation benchmark: WMT14 English-German (En-De)4. 4http://www.statmt.org/wmt14/translation-task.html |
| Dataset Splits | Yes | Data. We use the data set commonly used by prior work as our evaluation benchmark: WMT14 English-German (En-De)4. We use newstest2013 as the validation set for selecting the best model, and newstest2014 as the test set. |
| Hardware Specification | No | The paper states 'All the models are run on 8 GPUs for 300, 000 updates...' and 'all the other models are trained on 8 GPUs...'. However, it does not specify the model or type of GPU, or any other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'fairseq (Ott et al., 2019)' and 'Adam optimizer (Kingma & Ba, 2014)'. While software names are provided, no specific version numbers are given for these or other key software components to ensure reproducibility. |
| Experiment Setup | Yes | For all experiments, we adopt the Adam optimizer (Kingma & Ba, 2014) using β1 = 0.9, β2 = 0.98, ϵ = 1e 8. The learning rate is scheduled using inverse sqrt with a maximum learning rate 0.0005 and 4000 warmup steps. We set the label smoothing as 0.1. All the models are run on 8 GPUs for 300, 000 updates with an effective batch size of 32, 000 tokens. |