On Learning Universal Representations Across Languages
Authors: Xiangpeng Wei, Rongxiang Weng, Yue Hu, Luxi Xing, Heng Yu, Weihua Luo
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation. Experimental results show that the HICTL outperforms the state-of-the-art XLM-R by an absolute gain of 4.2% accuracy on the XTREME benchmark as well as achieves substantial improvements on both of the highresource and low-resource English X translation tasks over strong baselines. |
| Researcher Affiliation | Collaboration | 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China 2School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China {weixiangpeng,huyue,xingluxi}@iie.ac.cn 3Machine Intelligence Technology Lab, Alibaba Group, Hangzhou, China {wengrx,yuheng.yh,weihua.luowh}@alibaba-inc.com |
| Pseudocode | No | No structured pseudocode or algorithm blocks (e.g., labeled 'Algorithm 1') were found in the paper. The method is described using mathematical notation and prose. |
| Open Source Code | No | The paper mentions an 'official submission to XTREME (https://sites.research.google/xtreme)' but does not explicitly state that the source code for their methodology is provided or linked. |
| Open Datasets | Yes | During pre-training, we follow Conneau et al. (2020) to build a Common-Crawl Corpus using the CCNet (Wenzek et al., 2019) tool1 for monolingual texts. Table 7 (see appendix A) reports the language codes and data size in our work. For parallel data, we use the same (English-to-X) MT dataset as (Conneau & Lample, 2019), which are collected from Multi UN (Eisele & Yu, 2010) for French, Spanish, Arabic and Chinese, the IIT Bombay corpus (Kunchukuttan et al., 2018a) for Hindi, the Open Subtitles 2018 for Turkish, Vietnamese and Thai, the EUbookshop corpus for German, Greek and Bulgarian, Tanzil for both Urdu and Swahili, and Global Voices for Swahili. Table 8 (see appendix A) shows the statistics of the parallel data. |
| Dataset Splits | Yes | We concatenate newstest 2012 and newstest 2013 as the validation set and use newstest 2014 as the test set. ... We split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set. |
| Hardware Specification | Yes | We run the pre-training experiments on 8 V100 GPUs, batch size 1024. |
| Software Dependencies | No | The paper mentions tools like 'multi-bleu.perl' and 'sacre BLEU' and refers to using the 'sentence-piece model with XLM-R', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Hyperparameters for pre-training and fine-tuning are shown in Table 9 (see appendix B). We run the pre-training experiments on 8 V100 GPUs, batch size 1024. The number of negative samples m=512 for word-level contrastive learning. |