Cross-Lingual Natural Language Generation via Pre-Training

Authors: Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, Heyan Huang7570-7577

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on question generation and abstractive summarization show that our model outperforms the machine-translation-based pipeline methods for zero-shot cross-lingual generation. Moreover, crosslingual transfer improves NLG performance of low-resource languages by leveraging rich-resource language data.
Researcher Affiliation Collaboration Beijing Institute of Technology Microsoft Research {czw, maoxl, hhy63}@bit.edu.cn {lidong1, fuwei, Wenhui.Wang}@microsoft.com
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our implementation and data are available at https://github.com/ CZWin32768/xnlg.
Open Datasets Yes We use SQu AD 1.1 (Rajpurkar et al. 2016) as the English QG dataset. For Chinese QG, we follow the default data splits of Web QA (Li et al. 2016). We use Wikipedia as the monolingual data for the DAE objective, and Multi UN (Ziemski, Junczys-Dowmunt, and Pouliquen 2016) as the parallel data for the XAE objective.
Dataset Splits Yes We use SQu AD 1.1 (Rajpurkar et al. 2016) as the English QG dataset. It is a popular English question answering dataset containing over 100,000 questions and their corresponding annotated passages. Following (Zhao et al. 2018), we regard the original development set as the test set, and sample 5000 examples from the training data of two datasets as the development sets. For Chinese QG, we follow the default data splits of Web QA (Li et al. 2016). ... For each language, we sample 500k/5k/5k examples for training/validation/test.
Hardware Specification Yes It takes about 30 hours to run 23,000 steps for the pre-training procedure by using 4 Nvidia Telsa V100-16GB GPUs.
Software Dependencies No The paper mentions using 'the tokenizer provided by (Chang, Galley, and Manning 2008) for Chinese, and Moses1 for other languages, respectively.' and 'subword vocabulary learned by BPE (Sennrich, Haddow, and Birch 2015)'. While these are software-related, no specific version numbers for these tools or other common libraries like PyTorch, TensorFlow, etc., are provided, which are necessary for full reproducibility.
Experiment Setup Yes We use Adam optimizer with a linear warm-up over the first 4,000 steps and linear decay for later steps, and the learning rate is set to 10 4. The pre-training batch size is 64, and the sequence length is set to 256. ... For fine-tuning on downstream NLG tasks, we use Adam optimizer with a learning rate of 5 10 6. We set the batch size as 16 and 32 for question generation and abstractive summarization, respectively. When the target language is the same as the language of training data, we fine-tune all parameters. When the target language is different from the language of training data, we fine-tune the Transformer layers of the encoder. ... During decoding, we use beam search with a beam size of 3, and limit the length of the target sequence to 80 tokens.