AdMix: A Mixed Sample Data Augmentation Method for Neural Machine Translation
Authors: Chang Jin, Shigui Qiu, Nini Xiao, Hao Jia
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on three translation datasets of different scales show that Ad Mix achieves significant improvements (1.0 to 2.7 BLEU points) over strong Transformer baseline. When combined with other data augmentation techniques (e.g., back-translation), our approach can obtain further improvements. |
| Researcher Affiliation | Academia | Chang Jin , Shigui Qiu , Nini Xiao , Hao Jia Institute of Aritificial Intelligence, School of Computer Science and Technology, Soochow University {cjin, sgqiu, nnxiaoxiao, hjia}@stu.suda.edu.cn |
| Pseudocode | Yes | Algorithm 1 Ad Mix Pseudocode |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Datasets. For IWSLT14 German-English, following Edunov et al. [2018], we apply the byte-pair encoding (BPE) [Sennrich et al., 2016b] script to preprocess the training corpus with 10K joint operations, which consists of 0.16M sentence pairs. For the Chinese-English translation task, the training set is the LDC corpus which contains 1.25M sentence pairs. For English-German translation, we use the WMT14 corpus consisting of 4.5M sentence pairs. |
| Dataset Splits | Yes | For IWSLT14 German-English, following Edunov et al. [2018], we apply the byte-pair encoding (BPE) [Sennrich et al., 2016b] script to preprocess the training corpus with 10K joint operations, which consists of 0.16M sentence pairs. The validation set is split from the training set and the test set is the concatenation of tst2010, tst2011, tst2012, dev2010, and dev2012. For the Chinese-English translation task, the training set is the LDC corpus which contains 1.25M sentence pairs. The validation set is the NIST 06 dataset, and test sets are NIST 02, 03, 04, 05, 08. For English-German translation, we use the WMT14 corpus consisting of 4.5M sentence pairs. The validation set is newstest2013 and the test set is newstest2014. |
| Hardware Specification | Yes | We train on two V100 GPUs and accumulate the gradients 2 times before updating. |
| Software Dependencies | No | The paper mentions using Adam for optimization but does not provide specific version numbers for any software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | For IWSLT14 German-English tasks, the dimensions of the embedding, feed-forward network, and the number of layers of the Transformer models are 512, 1024, and 6 respectively. The dropout rate is 0.3, and the batch size is 8192 tokens. For LDC Chinese-English task and WMT14 German-English task, the dimensions of the embedding, feed-forward network, and the number of layers of the Transformer models are 512, 2048, and 6 respectively. The dropout rate is 0.3, 0.1 separately for Zh-En and En-De tasks, and the batch size is both 8192 tokens. For all models but except the En-De task model, we use Adam with learning rate 5 10 4 and the inverse sqrt learning rate scheduler to optimize the models. For En-De task model, we use Adam with learning rate 7 10 4 and the inverse sqrt learning rate scheduler. There are two important hyperparameters in our approach, λ in objective loss function and the discrete noise fractions γ. For all datasets, we set the noise fractions γ as 0.1 and the hyperparameter λ as 10 by default. |