Mixture Models for Diverse Machine Translation: Tricks of the Trade
Authors: Tianxiao Shen, Myle Ott, Michael Auli, Marc’Aurelio Ranzato
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis shows that certain types of mixture models are more robust and offer the best trade-off between translation quality and diversity compared to variational models and diverse decoding approaches.1 |
| Researcher Affiliation | Collaboration | 1MIT CSAIL 2Facebook AI Research. |
| Pseudocode | No | The paper describes algorithms (e.g., EM algorithm) but does not provide them in a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code to reproduce the results in this paper is available at https://github.com/pytorch/fairseq |
| Open Datasets | Yes | Datasets We test mixture models and baselines on three benchmark datasets that uniquely provide multiple human references (Ott et al., 2018a; Hassan et al., 2018). WMT 17 English-German (En-De): We train on all available bitext and filter sentence pairs that have source or target longer than 80 words, resulting in 4.5M sentence pairs. WMT 14 English-French (En-Fr): We borrow the setup of Gehring et al. (2017) with 36M training sentence pairs and 40K joint BPE vocabulary. WMT 17 Chinese-English (Zh-En): We pre-process the training data following Hassan et al. (2018) which results in 20M sentence pairs |
| Dataset Splits | Yes | WMT 17 English-German (En-De): We develop on newstest2013 and test on a 500 sentence subset of newstest2014 that has 10 reference translations (Ott et al., 2018a). WMT 14 English-French (En-Fr): We validate on newstest2012+2013, and test on a 500 sentence subset of newstest2014 with 10 reference translations (Ott et al., 2018a). WMT 17 Chinese-English (Zh-En): We develop on devtest2017 and report results on newstest2017 with 3 reference translations. |
| Hardware Specification | Yes | We run experiments on between 8 and 128 Nvidia V100 GPUs with mini-batches of approximately 25K and 400K tokens for the experiments of 5.1 and 5.2, respectively, following Ott et al. (2018b). |
| Software Dependencies | No | The paper mentions software like Fairseq (Ott et al., 2019), Moses tokenizer (Koehn et al., 2007), and Adam algorithm (Kingma & Ba, 2015). While Fairseq is cited, implying a version associated with that publication, specific version numbers (e.g., vX.Y) for Fairseq or other libraries like PyTorch or CUDA are not explicitly stated within the text. |
| Experiment Setup | Yes | The encoder and decoder have 6 blocks. The number of attention heads, embedding dimension and inner-layer dimension are 8, 512, 2048 for the base configuration and 16, 1024, 4096 for the big configuration, respectively. Models are optimized with the Adam algorithm (Kingma & Ba, 2015) using β1 = 0.9, β2 = 0.98, and ϵ = 1e 8. We use the same learning rate schedule as Ott et al. (2018b). We run experiments on between 8 and 128 Nvidia V100 GPUs with mini-batches of approximately 25K and 400K tokens for the experiments of 5.1 and 5.2, respectively. |