Synthetic and Natural Noise Both Break Neural Machine Translation

Authors: Yonatan Belinkov, Yonatan Bisk

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1 shows how the performance of two state-of-the-art NMT systems degrades when translating German to English as a function of the percent of German words modified. Here we show three types of noise: 1) Random permutation of the word, 2) Swapping a pair of adjacent letters, and 3) Natural human errors. We discuss these types of noise and others in depth in section 4.2. The important thing to note is that even small amounts of noise lead to substantial drops in performance.
Researcher Affiliation Academia Yonatan Belinkov Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology belinkov@mit.edu Yonatan Bisk Paul G. Allen School of Computer Science & Engineering, University of Washington ybisk@cs.washington.edu
Pseudocode No The paper describes its methods and algorithms in text and presents data in tables and figures, but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes In order to facilitate future work on noise in NMT, we release code and data for generating the noise used in our experiments.3 https://github.com/ybisk/char NMT-noise
Open Datasets Yes We use the TED talks parallel corpus prepared for IWSLT 2016 (Cettolo et al., 2012) for testing all of the NMT systems, as well as for training the char CNN models.
Dataset Splits Yes We follow the official training/development/test splits. All texts are tokenized with the Moses tokenizer. Table 1 summarizes statistics on the TED talks corpus.
Hardware Specification No The paper does not provide any specific details regarding the hardware used for running experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper mentions using 'the implementation in Kim (2016)' and 'the Moses tokenizer' but does not provide specific version numbers for these or any other software dependencies required for replication.
Experiment Setup Yes The char CNN model has two long short-term memory (Hochreiter & Schmidhuber, 1997) layers in the encoder and decoder. A CNN over characters in each word replaces the word embeddings on the encoder side (for simplicity, the decoder is word-based). We use 1000 filters with a width of 6 characters. The character embedding size is set to 25. The convolutions are followed by Tanh and max-pooling over the length of the word (Kim et al., 2015). We train char CNN with the implementation in Kim (2016); all other settings are kept to default values.