Understanding and Improving Lexical Choice in Non-Autoregressive Translation
Authors: Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, Zhaopeng Tu
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words. |
| Researcher Affiliation | Collaboration | Liang Ding1 , Longyue Wang2, Xuebo Liu3, Derek F. Wong3, Dacheng Tao1 & Zhaopeng Tu2 1The University of Sydney 2Tencent AI Lab 3University of Macau |
| Pseudocode | No | No pseudocode or algorithm block found. |
| Open Source Code | Yes | Code is available at: https://github.com/alphadl/LCNAT |
| Open Datasets | Yes | Experiments were conducted on four widely-used translation datasets: WMT14 English German (En-De, Vaswani et al. 2017), WMT16 Romanian-English (Ro-En, Gu et al. 2018), WMT17 Chinese-English (Zh-En, Hassan et al. 2018), and WAT17 Japanese-English (Ja-En, Morishita et al. 2017) |
| Dataset Splits | Yes | We use the same validation and test datasets with previous works for fair comparison. To avoid unknown words, we preprocessed data via BPE (Sennrich et al., 2016) with 32K merge operations. The GIZA++ (Och & Ney, 2003) was employed to build word alignments for the training datasets. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) are mentioned for the experimental setup. |
| Software Dependencies | No | The paper mentions BPE, GIZA++, and Adam, but does not provide specific version numbers for these or any other software libraries or dependencies used in the implementation. |
| Experiment Setup | Yes | For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction, and apply weight decay with 0.01 and label smoothing with ϵ = 0.1. We train batches of approximately 128K tokens using Adam (Kingma & Ba, 2015). The learning rate warms up to 5 × 10−4 in the first 10K steps, and then decays with the inverse square-root schedule. |