EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning
Authors: Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei Huang, jun xie
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance. |
| Researcher Affiliation | Collaboration | Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China Machine Intelligence Technology Lab, Alibaba DAMO Academy, Hangzhou, China |
| Pseudocode | Yes | For a clearer presentation, an Algorithm of EMMA-X is shown in Algorithm 1. |
| Open Source Code | Yes | Codes and datasets of the XRETE benchmark: https://github.com/guopingiie/EMMA-X |
| Open Datasets | Yes | We collect parallel corpora from CCAligned [El-Kishky et al., 2020], CCMatrix [Schwenk et al., 2021], WMT [Akhbardeh et al., 2021], and Multi UN [Ziemski et al., 2016], involving 94 languages with 3.2 billion sentence pairs. In addition, we add CC-100 [Conneau et al., 2020] as the large-scale monolingual corpus with about 800 billion sentences that covers 94 languages. |
| Dataset Splits | Yes | Table 9: Overview of XRETE tasks. For tasks that have training and dev sets in other languages, we only report the number of sentences in English sets. We report the number of test examples per language. Example: XNLI Train 392,702 Dev 222-743 Test 738-750. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, or cloud computing instances with specifications) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions tools like 'Sentence Piece Model' and optimizers like 'Adam' with citations, but does not specify software library versions (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | Yes | The GMM classifier is implemented as a mixture of Gaussian forms, each of which consists of a prior π R1, a mean µ R1024 and a standard deviation σ R1024, all are trainable variables. We set the total semantic ranks as N = 4. We optimize the GMM classifier with Adam (β1=0.9, β2=0.999) Kingma and Ba [2015] using a batch size of 1024 and a learning rate of 3e-5. For cross-lingual encoder, we apply the same training setting as Mo Co He et al. [2020], with the momentum queue K to be 256 and temperature as 0.04. We set the momentum coefficient to 0.999 and use the Adam optimizer with a cosine decay learning rate whose peak is 5e-4. |