Analyzing Uncertainty in Neural Machine Translation
Authors: Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments rely on the pre-trained models of the fairseq-py toolkit (Gehring et al., 2017), which achieve competitive performance on the datasets we consider. ... We evaluate with tokenized BLEU (Papineni et al., 2002) on the corpus-level and the sentence-level, after removing BPE splitting. Sentence-level BLEU is computed similarly to corpus BLEU, but with smoothed n-gram counts (+1) for n > 1 (Lin & Och, 2004). |
| Researcher Affiliation | Industry | 1Facebook AI Research, USA. |
| Pseudocode | No | The paper describes the sequence-to-sequence model and its training process in textual paragraphs, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We also release the data we collected for our evaluation, which consists of ten human translations for 500 sentences taken from the WMT 14 En-Fr and En-De test sets. Additional reference translations are available from: https://github.com/facebookresearch/analyzing-uncertainty-nmt.' This provides access to data, not the source code for the described methodology. |
| Open Datasets | Yes | We consider the following datasets: WMT 14 English-German (En-De): We use the same setup as Luong et al. (2015) which comprises 4.5M sentence pairs for training and we test on newstest2014. ... WMT 17 English-German (En-De): ... WMT 14 English-French (En-Fr): We remove sentences longer than 175 words and pairs with a source/target length ratio exceeding 1.5 resulting in 35.5M sentence pairs for training. |
| Dataset Splits | Yes | WMT 14 English-German (En-De): We use the same setup as Luong et al. (2015) which comprises 4.5M sentence pairs for training and we test on newstest2014. We build a validation set by removing 44k random sentence-pairs from the training data. ... WMT 14 English-French (En-Fr): Results are reported on both newstest2014 and a validation set held-out from the training data comprising 26k sentence pairs. |
| Hardware Specification | No | The paper states 'Our experiments rely on the pre-trained models of the fairseq-py toolkit,' but it does not provide specific details about the hardware used (e.g., GPU models, CPU types, memory) for running experiments. |
| Software Dependencies | No | The paper mentions using the 'fairseq-py toolkit (Gehring et al., 2017)' but does not specify its version number or the versions of other software dependencies required to reproduce the experiments. |
| Experiment Setup | Yes | Formally, let x be an input sentence with m words {x1, . . . , xm}, and t be the ground truth target sentence with n words {t1, . . . , tn}. The model is composed of an encoder and a decoder. ... To train the translation model, we minimize the cross-entropy loss: L = Pn i=1 log p(ti|ti 1, . . . , t1, x), using Nesterov s momentum (Sutskever et al., 2013). ... At test time, we aim to output the most likely translation given the source sentence, according to the model estimate. We approximate such an output via beam search. Unless otherwise stated, we use beam width k = 5, where hypotheses are selected based on their length-normalized log-likelihood. |