Simplify-Then-Translate: Automatic Preprocessing for Black-Box Translation

Authors: Sneha Mehta, Bahareh Azarnoush, Boris Chen, Avneesh Saluja, Vinith Misra, Ballav Bihani, Ritwik Kumar8488-8495

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that this preprocessing leads to better translation performance as compared to non-preprocessed source sentences. We further perform side-by-side human evaluation to verify that translations of the simplified sentences are better than the original ones. Finally, we provide some guidance on recommended language pairs for generating the simplification model corpora by investigating the relationship between ease of translation of a language pair (as measured by BLEU) and quality of the resulting simplification model from backtranslations of this language pair (as measured by SARI), and tie this into the downstream task of low-resource translation.
Researcher Affiliation Collaboration Sneha Mehta,1 Bahareh Azarnoush,2 Boris Chen,2 Avneesh Saluja,2 Vinith Misra,2 Ballav Bihani,2 Ritwik Kumar2 1Department of Computer Science, Virginia Tech, VA 2Netflix Inc., CA snehamehta@vt.edu, {bazarnoush, bchen, asaluja, vmisra, bbihani, ritwikk}@netflix.com
Pseudocode No The paper describes the APP procedure in a step-by-step manner but does not present it as formal pseudocode or a clearly labeled algorithm block.
Open Source Code No The paper does not provide an explicit statement about releasing source code for their methodology or a link to a code repository.
Open Datasets Yes Wikilarge The Wiki Large dataset (Zhang and Lapata, 2017) was compiled by using sentence alignments from other Wikipedia-based datasets (Zhu, Bernhard, and Gurevych, 2010; Woodsend and Lapata, 2011; Kauchak, 2013) and Open Subtitles The Open Subtitles dataset (Lison and Tiedemann, 2016) is a collection of translated movie subtitles obtained from opensubtitles.org6.
Dataset Splits Yes The train split contains 296K sentence pairs and the validation split contains 992 sentence pairs.
Hardware Specification Yes We run all experiments using machines with 4 Nvidia V100 GPUs.
Software Dependencies No The paper mentions using the `tensor2tensor` library and an implementation for TER score provided by (Snover et al., 2006), but it does not specify version numbers for these software dependencies.
Experiment Setup Yes All experiments are based on the transformer base architecture with 6 blocks in the encoder and decoder. We use the same hyper-parameters for all experiments, i.e., word representations of size 512 and feed-forward layers with inner dimension 4096. Dropout is set to 0.2 and we use 8 attention heads. Models are optimized with Adam (Kingma and Ba, 2014) using β1 = 0.9, β2 = 0.98, and ϵ = 1e 9, with the same learning rate schedule as Vaswani et al. (2017). We use 50,000 warmup steps. All models use label smoothing of 0.1 with a uniform prior distribution over the vocabulary. We use a sub-word vocabulary of size 32K implemented using the word-piece algorithm (Sennrich, Haddow, and Birch, 2016a).