Mixture Models for Diverse Machine Translation: Tricks of the Trade

Authors: Tianxiao Shen, Myle Ott, Michael Auli, Marc’Aurelio Ranzato

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis shows that certain types of mixture models are more robust and offer the best trade-off between translation quality and diversity compared to variational models and diverse decoding approaches.1
Researcher Affiliation Collaboration 1MIT CSAIL 2Facebook AI Research.
Pseudocode No The paper describes algorithms (e.g., EM algorithm) but does not provide them in a structured pseudocode or algorithm block.
Open Source Code Yes Code to reproduce the results in this paper is available at https://github.com/pytorch/fairseq
Open Datasets Yes Datasets We test mixture models and baselines on three benchmark datasets that uniquely provide multiple human references (Ott et al., 2018a; Hassan et al., 2018). WMT 17 English-German (En-De): We train on all available bitext and filter sentence pairs that have source or target longer than 80 words, resulting in 4.5M sentence pairs. WMT 14 English-French (En-Fr): We borrow the setup of Gehring et al. (2017) with 36M training sentence pairs and 40K joint BPE vocabulary. WMT 17 Chinese-English (Zh-En): We pre-process the training data following Hassan et al. (2018) which results in 20M sentence pairs
Dataset Splits Yes WMT 17 English-German (En-De): We develop on newstest2013 and test on a 500 sentence subset of newstest2014 that has 10 reference translations (Ott et al., 2018a). WMT 14 English-French (En-Fr): We validate on newstest2012+2013, and test on a 500 sentence subset of newstest2014 with 10 reference translations (Ott et al., 2018a). WMT 17 Chinese-English (Zh-En): We develop on devtest2017 and report results on newstest2017 with 3 reference translations.
Hardware Specification Yes We run experiments on between 8 and 128 Nvidia V100 GPUs with mini-batches of approximately 25K and 400K tokens for the experiments of 5.1 and 5.2, respectively, following Ott et al. (2018b).
Software Dependencies No The paper mentions software like Fairseq (Ott et al., 2019), Moses tokenizer (Koehn et al., 2007), and Adam algorithm (Kingma & Ba, 2015). While Fairseq is cited, implying a version associated with that publication, specific version numbers (e.g., vX.Y) for Fairseq or other libraries like PyTorch or CUDA are not explicitly stated within the text.
Experiment Setup Yes The encoder and decoder have 6 blocks. The number of attention heads, embedding dimension and inner-layer dimension are 8, 512, 2048 for the base configuration and 16, 1024, 4096 for the big configuration, respectively. Models are optimized with the Adam algorithm (Kingma & Ba, 2015) using β1 = 0.9, β2 = 0.98, and ϵ = 1e 8. We use the same learning rate schedule as Ott et al. (2018b). We run experiments on between 8 and 128 Nvidia V100 GPUs with mini-batches of approximately 25K and 400K tokens for the experiments of 5.1 and 5.2, respectively.