Hidden Markov Transformer for Simultaneous Machine Translation

Authors: Shaolei Zhang, Yang Feng

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple Si MT benchmarks show that HMT outperforms strong baselines and achieves state-of-the-art performance.
Researcher Affiliation Academia 1Key Laboratory of Intelligent Information Processing Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2University of Chinese Academy of Sciences, Beijing, China
Pseudocode Yes Algorithm 1 Inference Policy of Hidden Markov Transformer
Open Source Code Yes Code is available at https://github.com/ictnlp/HMT
Open Datasets Yes IWSLT15 English Vietnamese (En Vi) ... We use TED tst2012 (1553 pairs) as the validation set and TED tst2013 (1268 pairs) as the test set. Following the previous setting (Ma et al., 2020; Zhang & Feng, 2021c)... WMT15 German English (De En) (4.5M pairs) We use newstest2013 (3000 pairs) as the validation set and newstest2015 (2169 pairs) as the test set. BPE (Sennrich et al., 2016) is applied with 32K merge operations and the vocabulary of German and English is shared. Footnote 4: nlp.stanford.edu/projects/nmt/. Footnote 5: statmt.org/wmt15/translation-task.html
Dataset Splits Yes IWSLT15 English Vietnamese (En Vi) ... We use TED tst2012 (1553 pairs) as the validation set and TED tst2013 (1268 pairs) as the test set. ... WMT15 German English (De En) ... We use newstest2013 (3000 pairs) as the validation set and newstest2015 (2169 pairs) as the test set.
Hardware Specification Yes All speeds are evaluated on NVIDIA 3090 GPU.
Software Dependencies No The paper states, 'All systems are based on Transformer (Vaswani et al., 2017) from Fairseq Library (Ott et al., 2019).' However, it does not provide specific version numbers for Fairseq or any other software libraries.
Experiment Setup Yes Appendix D, Table 7 provides detailed hyperparameter settings for HMT, including encoder/decoder layers, attention heads, embed/ffn dimensions, dropout, optimizer (adam with beta values), learning rate, scheduler, warmup updates, weight decay, label smoothing, and max tokens for different Transformer sizes (Small, Base, Big) and datasets.