Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Monotonic Multihead Attention

Authors: Xutai Ma, Juan Miguel Pino, James Cross, Liezl Puzon, Jiatao Gu

ICLR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply MMA to the simultaneous machine translation task and demonstrate better latency-quality tradeoffs compared to MILk, the previous state-of-the-art approach. We analyze how the latency controls affect the attention span and we study the relationship between the speed of a head and the layer it belongs to.
Researcher Affiliation Collaboration 1Facebook 2Johns Hopkins University
Pseudocode Yes Algorithm 1 MMA monotonic decoding.
Open Source Code Yes The code is available at https://github.com/pytorch/fairseq/tree/master/ examples/simultaneous_translation
Open Datasets Yes Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169
Dataset Splits Yes Dataset Train Validation Test IWSLT15 En-Vi 133k 1268 1553 WMT15 De-En 4.5M 3000 2169
Hardware Specification No The paper does not explicitly describe the hardware specifications (e.g., GPU model, CPU, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'Fairseq library (Ott et al., 2019)' but does not provide specific version numbers for Fairseq or any other software dependencies.
Experiment Setup Yes Detailed hyperparameter settings can be found in subsection A.1. Hyperparameter WMT15 German-English IWSLT English-Vietnamese encoder embed dim 1024 512 encoder ffn embed dim 4096 1024 encoder attention heads 16 4 encoder layers 6 decoder embed dim 1024 512 decoder ffn embed dim 4096 1024 decoder attention heads 16 4 decoder layers 6 dropout 0.3 optimizer adam adam-β (0.9, 0.98) clip-norm 0.0 lr 0.0005 lr scheduler inverse sqrt warmup-updates 4000 warmup-init-lr 1e-07 label-smoothing 0.1 max tokens 3584 8 8 2 16000