EMMA-X: An EM-like Multilingual Pre-training Algorithm for Cross-lingual Representation Learning

Authors: Ping Guo, Xiangpeng Wei, Yue Hu, Baosong Yang, Dayiheng Liu, Fei Huang, jun xie

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate EMMA-X, we conduct experiments on XRETE, a newly introduced benchmark containing 12 widely studied cross-lingual tasks that fully depend on sentence-level representations. Results reveal that EMMA-X achieves state-of-the-art performance.
Researcher Affiliation Collaboration Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China Machine Intelligence Technology Lab, Alibaba DAMO Academy, Hangzhou, China
Pseudocode Yes For a clearer presentation, an Algorithm of EMMA-X is shown in Algorithm 1.
Open Source Code Yes Codes and datasets of the XRETE benchmark: https://github.com/guopingiie/EMMA-X
Open Datasets Yes We collect parallel corpora from CCAligned [El-Kishky et al., 2020], CCMatrix [Schwenk et al., 2021], WMT [Akhbardeh et al., 2021], and Multi UN [Ziemski et al., 2016], involving 94 languages with 3.2 billion sentence pairs. In addition, we add CC-100 [Conneau et al., 2020] as the large-scale monolingual corpus with about 800 billion sentences that covers 94 languages.
Dataset Splits Yes Table 9: Overview of XRETE tasks. For tasks that have training and dev sets in other languages, we only report the number of sentences in English sets. We report the number of test examples per language. Example: XNLI Train 392,702 Dev 222-743 Test 738-750.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, or cloud computing instances with specifications) used for running the experiments were provided in the paper.
Software Dependencies No The paper mentions tools like 'Sentence Piece Model' and optimizers like 'Adam' with citations, but does not specify software library versions (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication.
Experiment Setup Yes The GMM classifier is implemented as a mixture of Gaussian forms, each of which consists of a prior π R1, a mean µ R1024 and a standard deviation σ R1024, all are trainable variables. We set the total semantic ranks as N = 4. We optimize the GMM classifier with Adam (β1=0.9, β2=0.999) Kingma and Ba [2015] using a batch size of 1024 and a learning rate of 3e-5. For cross-lingual encoder, we apply the same training setting as Mo Co He et al. [2020], with the momentum queue K to be 256 and temperature as 0.04. We set the momentum coefficient to 0.999 and use the Adam optimizer with a cosine decay learning rate whose peak is 5e-4.