Alignment-Enhanced Transformer for Constraining NMT with Pre-Specified Translations

Authors: Kai Song, Kun Wang, Heng Yu, Yue Zhang, Zhongqiang Huang, Weihua Luo, Xiangyu Duan, Min Zhang8886-8893

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results on five language pairs show that our method is highly effective in constraining NMT with pre-specified translations, consistently outperforming previous methods in translation quality.
Researcher Affiliation Collaboration Kai Song,1,2 Kun Wang,1 Heng Yu,2 Yue Zhang,3 Zhongqiang Huang,2 Weihua Luo,2 Xiangyu Duan,1 Min Zhang1 1Soochow University, Suzhou, China 2Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China 3School of Engineering, Westlake University, Hangzhou, China
Pseudocode No The paper describes methods using text and figures but does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code No The paper states 'We use an in-house re-implementation of Transformer (Vaswani et al. 2017), similar to Google s Tensor2Tensor' but does not provide an explicit statement or link for the open-sourcing of their own methodology's code.
Open Datasets Yes Our training corpora are taken from the WMT news translation task. In particular, the training corpora of En-De and En-Ro are taken from WMT2014 and WMT2016, respectively. Training corpora for En-Ru, En-Fr and Ch-En are taken from WMT2018. To directly evaluate alignment extraction accuracy, we use two hand aligned, publicly available alignment test sets for En-Ro3 and En-De4. Footnote 3: https://www-i6.informatik.rwth-aachen.de/gold Alignment/ Footnote 4: http://web.eecs.umich.edu// mihalcea/wpt/index.html
Dataset Splits No The paper mentions 'development and test sets' and uses 'dev2016', 'dev2015', 'dev2017' in tables but does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) needed to reproduce the data partitioning for training, validation, and testing beyond the total training sizes.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions several software components and techniques (e.g., BPE, Adam, Fast Align, GIZA) but does not provide specific version numbers for the key software dependencies (e.g., programming languages, deep learning frameworks like TensorFlow/PyTorch, or specific library versions) that would be needed for replication.
Experiment Setup Yes BPE (Sennrich, Haddow, and Birch 2015b) is used in all experiments, where the number of merge operations is set to 30K for En-Ru and Ch-En, and 50K for En-Ro, En-De and En-Fr. We use six self-attention layers for both the encoder and the decoder. The embedding size and the hidden size are set to 512. Eight heads are used for the multihead self-attention architecture. The feed-forward layer has 2,048 cells and Re LU (Krizhevsky, Sutskever, and Hinton 2012) is used as the activation function. Adam (Kingma and Ba 2014) is used for training; warmup steps are set to 16,000; the learning rate is 0.0003. We use label smoothing (Junczys-Dowmunt, Dwojak, and Sennrich 2016) with a confidence score of 0.9, and all the drop-out (Gal and Ghahramani 2016) probabilities are set to 0.1. The vocabulary size is set to 30K for Ch-En and En-Ru, 50K for En-Ro, En-Fr and Ch-En. The hidden size of the additional attention is set to 512.