Neural Machine Translation with Byte-Level Subwords

Authors: Changhan Wang, Kyunghyun Cho, Jiatao Gu9154-9160

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE. Experimental Settings Datasets We run experiments on three bilingual corpora as well as a many-to-English multilingual dataset:
Researcher Affiliation Collaboration Facebook AI Research; New York University; CIFAR Global Scholar
Pseudocode No The paper describes an algorithm for decoding with byte-level subwords using dynamic programming (Eq. 1), but it is presented as a mathematical formula and descriptive text rather than structured pseudocode or an algorithm block.
Open Source Code No The paper does not provide a link to its own open-source code or explicitly state that the code for their method is available.
Open Datasets Yes English-German (En-De): we replicate the same setting of (Vaswani et al. 2017) which uses WMT 2014 1 data (newstest13 for validation and newstest14 for testing); Japanese-English (Ja-En): we follow (Michel and Neubig 2018) and concatenate KFTT2 (Neubig 2011), TED3 (Cettolo, Girardi, and Federico 2012) and JESC4 (Pryzant et al. 2017) to construct training, validation and test sets.; Sinhala-English (Si-En): we use the data from FLo Res (Guzm an et al. 2019).; Many-to-English (X-En): we adopt the TED Talks corpus complied by (Ye et al. 2018), which includes parallel data for 59 languages.
Dataset Splits Yes English-German (En-De): we replicate the same setting of (Vaswani et al. 2017) which uses WMT 2014 1 data (newstest13 for validation and newstest14 for testing); Japanese-English (Ja-En): we follow (Michel and Neubig 2018) and concatenate KFTT2 (Neubig 2011), TED3 (Cettolo, Girardi, and Federico 2012) and JESC4 (Pryzant et al. 2017) to construct training, validation and test sets.; For our experiments, we use English as target and the other 58 languages as source. We sample 22K examples from the 135K development set for validation.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using Fairseq, SentencePiece, and sacreBLEU but does not specify their version numbers.
Experiment Setup Yes All model configurations are listed in table 2. We set attention and Re LU dropout to 0.1, except Si-En for which we use 0.2. We use 0.2 residual dropout for Tbase models in X-En. We use a kernel size of 5 and a padding of 2 on both sides for all convolutional layers. We set beam width to 4 for En De and 5 for the other and use the best checkpoint by validation loss to generate the predictions.