RenewNAT: Renewing Potential Translation for Non-autoregressive Transformer
Authors: Pei Guo, Yisheng Xiao, Juntao Li, Min Zhang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on various translation benchmarks (e.g., 4 WMT) show that our framework consistently improves the performance of strong fully NAT methods (e.g., GLAT and DSLP) without additional speed overhead. |
| Researcher Affiliation | Academia | Institute of Artificial Intelligence, School of Computer Science and Technology, Soochow University {pguolst,ysxiaoo}@stu.suda.edu.cn, {ljt,minzhang}@suda.edu.cn |
| Pseudocode | Yes | Algorithm 1: Renew NAT Training |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We evaluate our Renew NAT on five widely used machine translation benchmarks, including WMT14 EN DE (4.5M pairs), WMT16 EN RO (610K pairs) and IWSLT14 DE EN (153K pairs). We follow the approach of Vaswani et al. (2017) to process WMT14 EN DE and adopt the approach to process data provided in Lee, Mansimov, and Cho (2018) for WMT16 EN RO. For IWSLT14 DE EN dataset, we follow the steps in Guo et al. (2019). |
| Dataset Splits | No | The paper mentions using 'validation BLEU scores' to choose checkpoints and evaluates on a 'test set', but does not explicitly provide specific percentages or sample counts for the training, validation, and test splits within the paper. It refers to external methods for data processing without detailing the splits. |
| Hardware Specification | Yes | specifically, we perform generation on the test set and set the batch-size as 1, then we compare the latency of Renew NAT with Vanilla Transformer on a single Nvidia A5000 card. |
| Software Dependencies | No | The paper mentions 'Fairseq (Ott et al. 2019)' but does not provide specific version numbers for Fairseq or any other software dependencies. |
| Experiment Setup | Yes | During training, we follow most of the hyperparameter settings in Gu and Kong (2021); Qian et al. (2021). For WMT datasets, we use base Transformer configuration (6 layers per stack, 8 attention heads per layer, 512 model dimensions, 2048 hidden dimensions) and adapt the warm-up learning rate schedule (Vaswani et al. 2017) with warming up to 5e-4 in 4k step. As IWSLT is a smaller dataset, we use a smaller Transformer model (6 layers per stack, 8 attention heads per layer, 512 model dimensions, 1024 hidden dimensions). We separately train the model with batches of 64k/8k tokens on WMT/IWSLT dataset and use Adam optimizer (Kingma and Ba 2014) β with (0.9, 0.999) and (0.9, 0.98) for GLAT and Vanilla NAT. We train all models for 300k steps and average the 5 best checkpoints chosen by validation BLEU scores as our final model for inference. It s noticed that we set K as 2 which achieves the best performance evaluated in Ablation Study. For AT models, we use a beam size of 5 for inference. For Renew NAT, we apply noisy parallel decoding denoted as NPD (Gu et al. 2018) and set the length beam as 5. |