Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

Authors: Wataru Hirota, Yoshihiko Suhara, Behzad Golshan, Wang-Chiew Tan7935-7943

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results based on several language pairs show that our specialized embeddings outperform the state-of-the-art multilingual sentence embedding model on the task of cross-lingual intent classification using only monolingual labeled data.
Researcher Affiliation Collaboration 1Osaka University, 2Megagon Labs w-hirota@ist.osaka-u.ac.jp, {yoshi, behzad, wangchiew}@megagon.ai
Pseudocode Yes Algorithm 1 shows a single training step of EMU.
Open Source Code Yes 1Our code is available at https://github.com/megagonlabs/emu.
Open Datasets Yes ATIS (Hemphill, Godfrey, and Doddington 1990) is a publicly available corpus for spoken dialog systems and is widely used for intent classification research. ... Quora3 is a publicly available paraphrase detection dataset that contains over 400k questions with duplicate labels.
Dataset Splits No The paper mentions splitting data into training and test sets, but does not explicitly describe a separate validation set split or its size/proportion. It states, 'We split the dataset into training and test sets so that the sentences used for fine-tuning do not appear in the test set.'
Hardware Specification No The paper does not specify the hardware used for the experiments (e.g., specific CPU/GPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions 'Py Torch' and using 'the official implementation of LASER' but does not provide specific version numbers for PyTorch or other software dependencies.
Experiment Setup Yes We used an initial learning rate of 10 3 and optimized the model with Adam. We used a batch size of 16. For our proposed methods, we set α = 50 and λ = 10 4. All the models were trained for 3 epochs. The architecture of language discriminator D has two 900-dimensional fully-connected layers with a dropout rate of 0.2. The hyperparameters were γ = 10 4, k = 5, c = 0.01 respectively. The language discriminator was also optimized with Adam with an initial learning rate of 5.0 10 4.