Multilingual Pre-training with Universal Dependency Learning
Authors: Kailai Sun, Zuchao Li, Hai Zhao
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify the cross-lingual modeling capabilities of our model, we carry on the experiments on both cross-lingual NLU benchmarks: XNLI and XQu AD, and linguistic structure parsing datasets: UD2 v2.7, SPMRL 14 [19], English Penn Treebank (PTB) 3.0 [20] and the Chinese Penn Treebank (CTB) 5.1 [21]. Our empirical results show that universal structure knowledge learnt and integrated can indeed help the multilingual Pr LM obtain better universal linguistic word representations and outperform m-BERT and XLM-R baselines in all the above tasks. |
| Researcher Affiliation | Academia | Kailai Sun , Zuchao Li , Hai Zhao 1Department of Computer Science and Engineering, Shanghai Jiao Tong University 2Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China 3Mo E Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University {kaishu2.0,charlee}@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn |
| Pseudocode | Yes | Algorithm 1: Training Process |
| Open Source Code | No | The paper does not provide any concrete access information or links to open-source code for the methodology described. |
| Open Datasets | Yes | For structure learning, we concatenate all the training Tree Banks covering 60 languages in Universal Dependencies Treebanks (v2.2) [30] as the training set. ... XNLI: Cross-lingual Natural Language Inference... Only English has training data, which is a crowd-sourced collection of 433k sentence pairs from Multi NLI [35]. ... XQu AD: Cross-lingual Question Answering Dataset [36] is a benchmark dataset... consists of a subset of 240 paragraphs and 1,190 question-answer pairs from the development set of SQu AD v1.1 [6]... |
| Dataset Splits | Yes | XNLI: Each language contains a development set with 2,490 sentence pairs and a test set with 5,010 sentence pairs. ... Models are trained for m = 600,000 and n = 600,000 epochs in each phase respectively, with Batch size = 128(sents)... |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like Word Piece [31] tokenization, Sentence Piece [32] tokenization, Adam optimizer [33], and GELU activation [34], but it does not specify version numbers for these software dependencies (e.g., Python 3.x, PyTorch 1.x). |
| Experiment Setup | Yes | Our UD-BERT and UD-XLM-Rbase use a Transformer architecture with L = 11, H = 768 and A = 12 with a vocabulary of 110k and 250k respectively. Our UD-XLM-Rlarge uses a large Transformer architecture with L = 23, H = 1024 and A = 16 with a 250k vocabulary. We train our models with the Adam optimizer [33] using the parameters: Learning rate = 5e-5, β1 = 0.9, β2 = 0.98, ϵ = 1e-6 and L2 weight decay of 0.01, a linear warmup [28], GELU activation [34] and a dropout rate of 0.1. Models are trained for m = 600,000 and n = 600,000 epochs in each phase respectively, with Batch size = 128(sents), and the probability of training USL in the second phase is p = 0.8. The max sequence length of MLM is 384 and the max sequence length for UD parsing is 256. In the USL layer, we set Hhead = 128 and Hdep = 64. |