Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation
Authors: Min Liu, Yu Bao, Chengqi Zhao, Shujian Huang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results on multiple WMT language directions and several representative models show that our approach can realize a flexible trade-off between the quality and complexity of training data for NAT models, achieving strong performances. We conduct experiments on two widely-used machine translation datasets: WMT14 English-German (En De) and WMT16 English-Romanian (En-Ro)... |
| Researcher Affiliation | Collaboration | Min Liu1 , Yu Bao2, Chengqi Zhao2, Shujian Huang1,3 1National Key Laboratory for Novel Software Technology, Nanjing University 2Byte Dance AI Lab 3Collaborative Innovation Center of Novel Software Technology and Industrialization |
| Pseudocode | Yes | Algorithm 1: Data Selection for the k-th Update |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the availability of open-source code for the described methodology. |
| Open Datasets | Yes | We conduct experiments on two widely-used machine translation datasets: WMT14 English-German (En De) and WMT16 English-Romanian (En-Ro), which consist of 3.96M and 0.6M sentence pairs, respectively. Following the common practices, we process the datasets with Moses script (Koehn et al. 2007) and segment the words into subword units using byte-pair encoding (BPE, Sennrich, Haddow, and Birch 2016). |
| Dataset Splits | No | The paper mentions using 'validation BLEU scores' but does not provide specific details on the training, validation, and test dataset splits, such as exact percentages, sample counts, or explicit references to predefined splits with full bibliographic information. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models or processor types) used for running its experiments. |
| Software Dependencies | No | The paper mentions tools like 'Moses script' and 'byte-pair encoding' and optimizers like 'Adam' but does not provide specific software dependencies or library versions needed for replication. |
| Experiment Setup | Yes | We follow the hyperparameters of models in their original papers. We set the dropout rate to 0.1 for WMT14 En-De/De-En and 0.3 for WMT16 En-Ro. For the optimizer, we use Adam with β = (0.9, 0.999) to train our model. The learning rate warms up to 5e 4 within 4k steps and then decays with the inverse square-root schedule. For the sampling ratio λ in GLAT+CTC, we adopt linear annealing from 0.5 to 0.3. As to the hard-to-easy learning strategy, we set T0 = 0.4, T1 = 1.0 under En-De/De-En and T0 = 0.6, T1 = 1.0 under En-Ro for GLAT+CTC. We set T0 = 0, T1 = 1.0 for other models. All the NAT evaluators and students are trained with batches of 64k tokens, lasting 300k updates and 100k updates for En-De/De-En and En-Ro respectively. |