Neural Machine Translation with Byte-Level Subwords
Authors: Changhan Wang, Kyunghyun Cho, Jiatao Gu9154-9160
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE. Experimental Settings Datasets We run experiments on three bilingual corpora as well as a many-to-English multilingual dataset: |
| Researcher Affiliation | Collaboration | Facebook AI Research; New York University; CIFAR Global Scholar |
| Pseudocode | No | The paper describes an algorithm for decoding with byte-level subwords using dynamic programming (Eq. 1), but it is presented as a mathematical formula and descriptive text rather than structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not provide a link to its own open-source code or explicitly state that the code for their method is available. |
| Open Datasets | Yes | English-German (En-De): we replicate the same setting of (Vaswani et al. 2017) which uses WMT 2014 1 data (newstest13 for validation and newstest14 for testing); Japanese-English (Ja-En): we follow (Michel and Neubig 2018) and concatenate KFTT2 (Neubig 2011), TED3 (Cettolo, Girardi, and Federico 2012) and JESC4 (Pryzant et al. 2017) to construct training, validation and test sets.; Sinhala-English (Si-En): we use the data from FLo Res (Guzm an et al. 2019).; Many-to-English (X-En): we adopt the TED Talks corpus complied by (Ye et al. 2018), which includes parallel data for 59 languages. |
| Dataset Splits | Yes | English-German (En-De): we replicate the same setting of (Vaswani et al. 2017) which uses WMT 2014 1 data (newstest13 for validation and newstest14 for testing); Japanese-English (Ja-En): we follow (Michel and Neubig 2018) and concatenate KFTT2 (Neubig 2011), TED3 (Cettolo, Girardi, and Federico 2012) and JESC4 (Pryzant et al. 2017) to construct training, validation and test sets.; For our experiments, we use English as target and the other 58 languages as source. We sample 22K examples from the 135K development set for validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using Fairseq, SentencePiece, and sacreBLEU but does not specify their version numbers. |
| Experiment Setup | Yes | All model configurations are listed in table 2. We set attention and Re LU dropout to 0.1, except Si-En for which we use 0.2. We use 0.2 residual dropout for Tbase models in X-En. We use a kernel size of 5 and a padding of 2 on both sides for all convolutional layers. We set beam width to 4 for En De and 5 for the other and use the best checkpoint by validation loss to generate the predictions. |