reproducibilityindex.ai

Understanding and Improving Lexical Choice in Non-Autoregressive Translation

Authors: Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao, Zhaopeng Tu

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across language pairs and model architectures demonstrate the effectiveness and universality of the proposed approach. Extensive analyses conﬁrm our claim that our approach improves performance by reducing the lexical choice errors on low-frequency words.
Researcher Affiliation	Collaboration	Liang Ding1 , Longyue Wang2, Xuebo Liu3, Derek F. Wong3, Dacheng Tao1 & Zhaopeng Tu2 1The University of Sydney 2Tencent AI Lab 3University of Macau
Pseudocode	No	No pseudocode or algorithm block found.
Open Source Code	Yes	Code is available at: https://github.com/alphadl/LCNAT
Open Datasets	Yes	Experiments were conducted on four widely-used translation datasets: WMT14 English German (En-De, Vaswani et al. 2017), WMT16 Romanian-English (Ro-En, Gu et al. 2018), WMT17 Chinese-English (Zh-En, Hassan et al. 2018), and WAT17 Japanese-English (Ja-En, Morishita et al. 2017)
Dataset Splits	Yes	We use the same validation and test datasets with previous works for fair comparison. To avoid unknown words, we preprocessed data via BPE (Sennrich et al., 2016) with 32K merge operations. The GIZA++ (Och & Ney, 2003) was employed to build word alignments for the training datasets. For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction
Hardware Specification	No	No specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) are mentioned for the experimental setup.
Software Dependencies	No	The paper mentions BPE, GIZA++, and Adam, but does not provide specific version numbers for these or any other software libraries or dependencies used in the implementation.
Experiment Setup	Yes	For regularization, we tune the dropout rate from [0.1, 0.2, 0.3] based on validation performance in each direction, and apply weight decay with 0.01 and label smoothing with ϵ = 0.1. We train batches of approximately 128K tokens using Adam (Kingma & Ba, 2015). The learning rate warms up to 5 × 10−4 in the ﬁrst 10K steps, and then decays with the inverse square-root schedule.