Generating Diverse Translation by Manipulating Multi-Head Attention

Authors: Zewei Sun, Shujian Huang, Hao-Ran Wei, Xin-yu Dai, Jiajun Chen8976-8983

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results show that our method generates diverse translations without a severe drop in translation quality. Experiments also show that back-translation with these diverse translations could bring a significant improvement in performance on translation tasks. An auxiliary experiment of conversation response generation task proves the effect of diversity as well.
Researcher Affiliation Academia State Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China sunzw@smail.nju.edu.cn, whr94621@foxmail.com {huangsj, daixinyu, chenjj}@nju.edu.cn
Pseudocode Yes Algorithm 1 Sample Policy
Open Source Code No The paper links to a TensorFlow Transformer implementation: "1https://github.com/tensorflow/tensor2tensor/blob/v1.3.0/ tensor2tensor/models/transformer.py", but this is a third-party library, not the authors' specific implementation code for their method. There is no explicit statement about releasing their own source code.
Open Datasets Yes NIST Chinese-to-English (NIST Zh-En). The training data consists of 1.34 million sentence pairs extracted from LDC corpus. We use MT03 as the development set, MT04, MT05, MT06 as the test sets.
Dataset Splits Yes NIST Chinese-to-English (NIST Zh-En)... We use MT03 as the development set... WMT14 English-to-German (WMT En-De)... We use newstest2013 as the development set... WMT16 English-to-Romanian (WMT En-Ro)... We use newstest2015 as the development set... IWSLT17 Chinese-to-English (IWSLT Zh-En)... We used dev2010 and tst2010 as the development set... Short Text Conversation (STC)... The develop set is made up similarly.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only references model settings and optimizers.
Software Dependencies No The paper states, "we follow the Transformer base v1 settings1" and provides a GitHub link to a TensorFlow file. However, it does not specify the version of TensorFlow or any other software libraries required with version numbers for reproducibility.
Experiment Setup Yes Without extra statement, we follow the Transformer base v1 settings1, with 6 layers in encoder and 2 layers in decoder2, 512 hidden units, 8 heads in multihead attention and 2048 hidden units in feed-forward layers. Parameters are optimized using Adam optimizer (Kingma and Ba 2015), with β1 = 0.9, β2 = 0.98, and ϵ = 10 9. The learning rate is scheduled according to the method proposed in Vaswani et al. (2017), with warmup steps = 8000. Label smoothing (Szegedy et al. 2016) of value = 0.1 is also adopted.