Parameter Differentiation Based Multilingual Neural Machine Translation
Authors: Qian Wang, Jiajun Zhang11440-11448
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on multilingual datasets have demonstrated that our method significantly outperforms various strong baselines with different parameter sharing configurations. |
| Researcher Affiliation | Academia | Qian Wang1,2, Jiajun Zhang1,2* 1National Laboratory of Pattern Recognition, Institute of Automation, CAS 2School of Artificial Intelligence, University of Chinese Academy of Sciences {qian.wang, jjzhang}@nlpr.ia.cn |
| Pseudocode | Yes | Algorithm 1: Parameter Differentiation |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of the methodology's code. |
| Open Datasets | Yes | We use the public OPUS and WMT multilingual datasets to evaluate our method on many-to-one (M2O) and one-to-many (O2M) translation scenarios, and the IWSLT datasets for the many-to-many (M2M) translation scenario. The OPUS dataset consists of English to 12 languages selected from the original OPUS-100 dataset (Zhang et al. 2020). The WMT dataset with unbalanced data distribution is collected from the WMT 14, WMT 16 and WMT 18 benchmarks. |
| Dataset Splits | Yes | The held-out multi-way aligned validation data for measuring gradient similarities contains 4, 000 sentences for each language, and are randomly selected and excluded from the training set. |
| Hardware Specification | Yes | All the models are trained and tested on a single Nvidia V100 GPU. |
| Software Dependencies | No | The paper mentions using the Transformer architecture, Adam optimizer, byte-pair encoding (BPE), and Sacre BLEU, but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We conduct our experiments with the Transformer architecture and adopt the transformer base setting which includes 6 encoder and decoder layers, 512/2048 hidden dimensions and 8 attention heads. Dropout (p = 0.1) and label smoothing (ϵls = 0.1) are applied during training but disabled during validation and inference. Each mini-batch contains roughly 8, 192 tokens. We accumulate gradients and update the model every 4 steps for OPUS and 8 steps for WMT to simulate multi-GPU training. In inference, we use beam search with the beam size of 4 and the length penalty of 0.6. The total training step Q is set to 400k for all experiments, and the differentiation happens every N = 8000 steps of training. We set the expected model size to O = 2 O0, 2 times of original model. |