Generating Diverse Translation by Manipulating Multi-Head Attention
Authors: Zewei Sun, Shujian Huang, Hao-Ran Wei, Xin-yu Dai, Jiajun Chen8976-8983
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results show that our method generates diverse translations without a severe drop in translation quality. Experiments also show that back-translation with these diverse translations could bring a significant improvement in performance on translation tasks. An auxiliary experiment of conversation response generation task proves the effect of diversity as well. |
| Researcher Affiliation | Academia | State Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China sunzw@smail.nju.edu.cn, whr94621@foxmail.com {huangsj, daixinyu, chenjj}@nju.edu.cn |
| Pseudocode | Yes | Algorithm 1 Sample Policy |
| Open Source Code | No | The paper links to a TensorFlow Transformer implementation: "1https://github.com/tensorflow/tensor2tensor/blob/v1.3.0/ tensor2tensor/models/transformer.py", but this is a third-party library, not the authors' specific implementation code for their method. There is no explicit statement about releasing their own source code. |
| Open Datasets | Yes | NIST Chinese-to-English (NIST Zh-En). The training data consists of 1.34 million sentence pairs extracted from LDC corpus. We use MT03 as the development set, MT04, MT05, MT06 as the test sets. |
| Dataset Splits | Yes | NIST Chinese-to-English (NIST Zh-En)... We use MT03 as the development set... WMT14 English-to-German (WMT En-De)... We use newstest2013 as the development set... WMT16 English-to-Romanian (WMT En-Ro)... We use newstest2015 as the development set... IWSLT17 Chinese-to-English (IWSLT Zh-En)... We used dev2010 and tst2010 as the development set... Short Text Conversation (STC)... The develop set is made up similarly. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only references model settings and optimizers. |
| Software Dependencies | No | The paper states, "we follow the Transformer base v1 settings1" and provides a GitHub link to a TensorFlow file. However, it does not specify the version of TensorFlow or any other software libraries required with version numbers for reproducibility. |
| Experiment Setup | Yes | Without extra statement, we follow the Transformer base v1 settings1, with 6 layers in encoder and 2 layers in decoder2, 512 hidden units, 8 heads in multihead attention and 2048 hidden units in feed-forward layers. Parameters are optimized using Adam optimizer (Kingma and Ba 2015), with β1 = 0.9, β2 = 0.98, and ϵ = 10 9. The learning rate is scheduled according to the method proposed in Vaswani et al. (2017), with warmup steps = 8000. Label smoothing (Szegedy et al. 2016) of value = 0.1 is also adopted. |