Monotonic Multihead Attention

Authors: Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, Jiatao Gu

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We analyze how the latency controls affect the attention span and we study the relationship between the speed of a head and the layer it belongs to.
Researcher Affiliation Collaboration 1Facebook 2Johns Hopkins University
Pseudocode Yes Algorithm 1 MMA monotonic decoding.
Open Source Code Yes The code is available at https://github.com/pytorch/fairseq/tree/master/ examples/simultaneous_translation
Open Datasets Yes Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169
Dataset Splits Yes Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169
Hardware Specification No The paper does not explicitly describe the hardware specifications (e.g., GPU model, CPU, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'Fairseq library (Ott et al., 2019)' but does not provide specific version numbers for Fairseq or any other software dependencies.
Experiment Setup Yes Detailed hyperparameter settings can be found in subsection A.1. Hyperparameter WMT15 German-English IWSLT English-Vietnamese encoder embed dim 1024 512 encoder ffn embed dim 4096 1024 encoder attention heads 16 4 encoder layers 6 decoder embed dim 1024 512 decoder ffn embed dim 4096 1024 decoder attention heads 16 4 decoder layers 6 dropout 0.3 optimizer adam adam-β (0.9, 0.98) clip-norm 0.0 lr 0.0005 lr scheduler inverse sqrt warmup-updates 4000 warmup-init-lr 1e-07 label-smoothing 0.1 max tokens 3584 8 8 2 16000