Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

Authors: Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-hsuan Sung, Brian Strope, Ray Kurzweil

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present an approach to learn multilingual sentence embeddings using a bi-directional dual-encoder with additive margin softmax. The embeddings are able to achieve state-of-the-art results on the United Nations (UN) parallel corpus retrieval task. In all the languages tested, the system achieves P@1 of 86% or higher. We use pairs retrieved by our approach to train NMT models that achieve similar performance to models trained on gold pairs. We explore simple document-level embeddings constructed by averaging our sentence embeddings. On the UN document-level retrieval task, document embeddings achieve around 97% on P@1 for all experimented language pairs. Lastly, we evaluate the proposed model on the BUCC mining task.
Researcher Affiliation Industry Yinfei Yang , Gustavo Hernandez Abrego , Steve Yuan , Mandy Guo , Qinlan Shen , Daniel Cer , Yun-hsuan Sung , Brian Strope and Ray Kurzweil Google AI Language {yinfeiy, gustavoha}@google.com
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for their methodology is publicly available.
Open Datasets Yes We evaluate the trained models on the United Nations Parallel Corpus reconstruction task, and the BUCC bitext mining task. ... The United Nations Parallel Corpus [Ziemski et al., 2016] contains 86,000 bilingual document pairs in five language pairs: from en to fr, es, ru, ar and zh. ... In this section, we evaluate the proposed models on the BUCC mining task [Zweigenbaum et al., 2018].
Dataset Splits Yes For each language pair, we use 90% of the sentence pairs for training and 10% as development set for parameter tuning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper mentions software components like 'transformer network architecture' and 'Adam optimization algorithm', but it does not specify version numbers for any key software dependencies.
Experiment Setup Yes The encoder uses a 3-layer transformer network architecture [Vaswani et al., 2017]. In the transformer layers, we use 8 attentions heads, a hidden size of 512, and a filter size of 2048. ... A margin value of 0.3 is used in all experiments. Training uses SGD with a 0.003 learning rate and batch size of 100. The learning rate is reduced to 0.0003 after 33 millions steps. Training concludes at 40 million steps. To improve training speed, we follow [Chidambaram et al., 2018] and multiply the gradients of the word and character embeddings by a factor of 25. We follow [Guo et al., 2018] and append 5 additional hard negatives for each example.