Monotonic Multihead Attention
Authors: Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, Jiatao Gu
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We analyze how the latency controls affect the attention span and we study the relationship between the speed of a head and the layer it belongs to. |
| Researcher Affiliation | Collaboration | 1Facebook 2Johns Hopkins University |
| Pseudocode | Yes | Algorithm 1 MMA monotonic decoding. |
| Open Source Code | Yes | The code is available at https://github.com/pytorch/fairseq/tree/master/ examples/simultaneous_translation |
| Open Datasets | Yes | Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169 |
| Dataset Splits | Yes | Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169 |
| Hardware Specification | No | The paper does not explicitly describe the hardware specifications (e.g., GPU model, CPU, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Fairseq library (Ott et al., 2019)' but does not provide specific version numbers for Fairseq or any other software dependencies. |
| Experiment Setup | Yes | Detailed hyperparameter settings can be found in subsection A.1. Hyperparameter WMT15 German-English IWSLT English-Vietnamese encoder embed dim 1024 512 encoder ffn embed dim 4096 1024 encoder attention heads 16 4 encoder layers 6 decoder embed dim 1024 512 decoder ffn embed dim 4096 1024 decoder attention heads 16 4 decoder layers 6 dropout 0.3 optimizer adam adam-β (0.9, 0.98) clip-norm 0.0 lr 0.0005 lr scheduler inverse sqrt warmup-updates 4000 warmup-init-lr 1e-07 label-smoothing 0.1 max tokens 3584 8 8 2 16000 |