Understanding and Improving Lexical Choice in Non-Autoregressive Translation

Authors: Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, Zhaopeng Tu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses confirm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words.
Researcher Affiliation Collaboration Liang Ding1 , Longyue Wang2, Xuebo Liu3, Derek F. Wong3, Dacheng Tao1 & Zhaopeng Tu2 1The University of Sydney 2Tencent AI Lab 3University of Macau
Pseudocode No No pseudocode or algorithm block found.
Open Source Code Yes Code is available at: https://github.com/alphadl/LCNAT
Open Datasets Yes Experiments were conducted on four widely-used translation datasets: WMT14 English German (En-De, Vaswani et al. 2017), WMT16 Romanian-English (Ro-En, Gu et al. 2018), WMT17 Chinese-English (Zh-En, Hassan et al. 2018), and WAT17 Japanese-English (Ja-En, Morishita et al. 2017)
Dataset Splits Yes We use the same validation and test datasets with previous works for fair comparison. To avoid unknown words, we preprocessed data via BPE (Sennrich et al., 2016) with 32K merge operations. The GIZA++ (Och & Ney, 2003) was employed to build word alignments for the training datasets. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction
Hardware Specification No No specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) are mentioned for the experimental setup.
Software Dependencies No The paper mentions BPE, GIZA++, and Adam, but does not provide specific version numbers for these or any other software libraries or dependencies used in the implementation.
Experiment Setup Yes For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction, and apply weight decay with 0.01 and label smoothing with ϵ = 0.1. We train batches of approximately 128K tokens using Adam (Kingma & Ba, 2015). The learning rate warms up to 5 × 10−4 in the first 10K steps, and then decays with the inverse square-root schedule.