Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Authors: Min Liu, Yu Bao, Chengqi Zhao, Shujian Huang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results on multiple WMT language directions and several representative models show that our approach can realize a flexible trade-off between the quality and complexity of training data for NAT models, achieving strong performances. We conduct experiments on two widely-used machine translation datasets: WMT14 English-German (En De) and WMT16 English-Romanian (En-Ro)...
Researcher Affiliation Collaboration Min Liu1 , Yu Bao2, Chengqi Zhao2, Shujian Huang1,3 1National Key Laboratory for Novel Software Technology, Nanjing University 2Byte Dance AI Lab 3Collaborative Innovation Center of Novel Software Technology and Industrialization
Pseudocode Yes Algorithm 1: Data Selection for the k-th Update
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of open-source code for the described methodology.
Open Datasets Yes We conduct experiments on two widely-used machine translation datasets: WMT14 English-German (En De) and WMT16 English-Romanian (En-Ro), which consist of 3.96M and 0.6M sentence pairs, respectively. Following the common practices, we process the datasets with Moses script (Koehn et al. 2007) and segment the words into subword units using byte-pair encoding (BPE, Sennrich, Haddow, and Birch 2016).
Dataset Splits No The paper mentions using 'validation BLEU scores' but does not provide specific details on the training, validation, and test dataset splits, such as exact percentages, sample counts, or explicit references to predefined splits with full bibliographic information.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models or processor types) used for running its experiments.
Software Dependencies No The paper mentions tools like 'Moses script' and 'byte-pair encoding' and optimizers like 'Adam' but does not provide specific software dependencies or library versions needed for replication.
Experiment Setup Yes We follow the hyperparameters of models in their original papers. We set the dropout rate to 0.1 for WMT14 En-De/De-En and 0.3 for WMT16 En-Ro. For the optimizer, we use Adam with β = (0.9, 0.999) to train our model. The learning rate warms up to 5e 4 within 4k steps and then decays with the inverse square-root schedule. For the sampling ratio λ in GLAT+CTC, we adopt linear annealing from 0.5 to 0.3. As to the hard-to-easy learning strategy, we set T0 = 0.4, T1 = 1.0 under En-De/De-En and T0 = 0.6, T1 = 1.0 under En-Ro for GLAT+CTC. We set T0 = 0, T1 = 1.0 for other models. All the NAT evaluators and students are trained with batches of 64k tokens, lasting 300k updates and 100k updates for En-De/De-En and En-Ro respectively.