Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-sentence Dependency Graph

Authors: Liyan Xu, Xuchao Zhang, Bo Zong, Yanchi Liu, Wei Cheng, Jingchao Ni, Haifeng Chen, Liang Zhao, Jinho D. Choi11538-11546

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on three multilingual MRC datasets (XQu AD, MLQA, Ty Di QA-Gold P) show that our encoder that is only trained on English is able to improve the zero-shot performance on all 14 test sets covering 8 languages, with up to 3.8 F1 / 5.2 EM improvement on-average, and 5.2 F1 / 11.2 EM on certain languages.
Researcher Affiliation Collaboration 1Department of Computer Science, Emory University, Atlanta, GA, USA 2NEC Laboratories America, Princeton, NJ, USA 1{liyan.xu, liang.zhao, jinho.choi}@emory.edu 2{xuczhang, bozong, yanchi, weicheng, jni, haifeng}@nec-labs.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methods are described in narrative text and mathematical equations.
Open Source Code Yes Our code is available at https://github.com/lxucs/ multilingual-mrc-isdg.
Open Datasets Yes We evaluate our models on three multilingual MRC benchmarks suggested by XTREME: XQu AD (Artetxe, Ruder, and Yogatama 2020), MLQA (Lewis et al. 2020), Ty Di QA-Gold P (Clark et al. 2020). For XQu AD and MLQA, models are trained on English SQu AD v1.1 (Rajpurkar et al. 2016) and evaluated directly on the test sets of each dataset in multiple target languages.
Dataset Splits Yes For XQu AD and MLQA, models are trained on English SQu AD v1.1 (Rajpurkar et al. 2016) and evaluated directly on the test sets of each dataset in multiple target languages. For Ty Di QA-Gold P, models are trained on its English training set and evaluated directly on its test sets.
Hardware Specification Yes All experiments are conducted on a Nvidia A100 GPU, with training time around 1 2 hours for the baseline and 2.5 4 hours for the ISDG encoder.
Software Dependencies No The paper states: "We implement our models in Py Torch and use Stanza (Qi et al. 2020) to provide the UD features." However, it does not provide specific version numbers for PyTorch or Stanza, which are required for reproducibility.
Experiment Setup Yes For m BERT and XLM-RLarge, we follow the similar hyperparameter settings as XTREME, with 384 max sequence length and 2 training epochs. For m T5Large, we only use its encoder and discard the decoder, and employ a learning rate of 1 10 4, which achieves the same baseline results as reported by Xue et al. (2021). For experiments with ISDG, we limit the max path length to be 8, and truncate long soft paths from the end. 64 hidden size is adopted for the POS and relation embedding. Following SG-Net (Zhang et al. 2020), we append one final selfattention layer stacked upon the ISDG encoder.