Explicit Alignment Learning for Neural Machine Translation

Authors: Zuchao Li, Hai Zhao, Fengshun Xiao, Masao Utiyama, Eiichiro Sumita

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted experiments on both small-scale (IWSLT14 De En and IWSLT13 Fr En) and large-scale (WMT14 En De, En Fr, WMT17 Zh En) benchmarks. Evaluation results show that our EAL methods significantly outperformed strong baseline methods, which shows the effectiveness of EAL.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 3National Institute of Information and Communications Technology (NICT), Kyoto, Japan
Pseudocode Yes Algorithm 1: Explicit Alignment Learning
Open Source Code No The paper does not provide any explicit statement or link indicating the release of open-source code for the described methodology.
Open Datasets Yes Our proposed method was evaluated on two typical translation tasks: rich-resource (WMT14 English-to-German (En De), English-to-French (En Fr), and WMT17 Chinese-to-English (Zh En)) and low-resource (IWSLT14 German-to-English (De En) and IWSLT13 French-to-English (Fr En)).
Dataset Splits Yes We randomly selected 4K of data from the training set as the validation set and used newstest2014 as the test set. The WMT14 En Fr training set contained 36M bilingual sentence pairs, the newstest2012 and newstest2013 datasets were combined for validation, and newstest2014 was used as the test set. The WMT17 Zh En training set contained 22M bilingual sentence pairs, and the newsdev2017 and newstest2017 datasets were used as the validation and test sets, respectively.
Hardware Specification Yes All our models were trained on eight NVIDIA V100 GPUs.
Software Dependencies No The paper mentions using the "Transformer structure" and the "BPE algorithm" but does not specify software dependencies with version numbers (e.g., deep learning frameworks, programming language versions).
Experiment Setup Yes Following [Vaswani et al., 2017], we used the same Transformer base/big setting for rich-resource datasets. The Transformer.base model consists of a six-layer encoder and six-layer decoder. The number of heads, dimension of word embeddings, number of hidden states, and number of position-wise feedforward networks were 8, 512, 512, and 2048, respectively. The dropout was 0.1 and attention head was 8. ... In our EAL approach, the decoder layer for extracting the alignment weight was set to LA = 4 and the number of warmup steps was set to w = 20K. We set the sampling ratio in our experiments to 20%, ...γ2 is set to 0.1, which is fixed during training. ... The learning rate setting strategy, which was the same as [Vaswani et al., 2017], was lr = d 0.5 min(step 0.5, step warmup 1.5 step ), where d is the dimension of embeddings, step is the number of training steps, and warmupstep is the number of warmup steps. ... The value of label smoothing was set to 0.1. The learning rate was varied under a warmup strategy, with warmup steps of 8,000.