Analysis of Joint Multilingual Sentence Representations and Semantic K-Nearest Neighbor Graphs

Authors: Holger Schwenk, Douwe Kiela, Matthijs Douze6982-6990

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our multilingual encoder outperforms previous work on large scale similarity search: we achieve a precision@1 of 83.3 on the reconstruction of the UN corpus of 11.3M English/French sentences, in comparison to P@1 of 48.9 obtained by (Guo et al. 2018); we define new quantitative evaluation tasks to analyze the generalization behavior of multilingual sentence embeddings with respect to unseen domains and languages; we show that our system is able to handle zero-shot transfer to several linguistically related languages without using any resources of those languages; the code used in this paper is freely available in the LASER toolkit (Language Agnostic SEntence Representations).
Researcher Affiliation Industry Holger Schwenk, Douwe Kiela, Matthijs Douze Facebook AI Research {schwenk,dkiela,matthijs}@fb.com
Pseudocode No The paper includes Figure 1 which illustrates the architecture but does not provide pseudocode or an algorithm block.
Open Source Code Yes the code used in this paper is freely available in the LASER toolkit (Language Agnostic SEntence Representations).1 We also make available the entire k-nn graph over all 566M sentences. 1https://github.com/facebookresearch/LASER
Open Datasets Yes We trained our model on the twenty-one languages of the Europarl corpus (Koehn 2005). These cover several and diversified language families: Germanic: English (en), Danish (da), Dutch (nl), German (de) and Swedish (sv); Romance: French (fr), Italian (it), Portuguese (pt) and Romanian (ro) and Spanish (es); Slavic: Bulgarian (bg), Czech (cs), Polish (pl), Slovak (sk) and Slovenian (sl); Baltic: Latvian (lv) and Lithuanian (lt); Uralic: Estonian (et), Hungarian (hu) and Finish (fi); Hellenic: Greek (el)... We use the aligned texts available on the OPUS web site (Tiedemann 2012).4
Dataset Splits No The paper mentions training on Europarl and evaluation on WMT and Tatoeba test sets, but does not specify details about a separate validation set split or its size/percentage for hyperparameter tuning.
Hardware Specification No The calculation of the 566M sentence embeddings took about 100h on GPU (which can be run in parallel by splitting the data), and the creation of the compressed index needed 12h on a multi-threaded CPU. All the distances of the 20-nn graph are calculated in a distributed way on 4 GPUs. This required about 55h of compute time.
Software Dependencies No We use byte-pair encoding (BPE) (Sennrich, Haddow, and Birch 2016) with 40k merge operations to learn a joint vocabulary for all the twenty-one languages.2 We tackle this computational challenge with the highly optimized FAISS library for efficient similarity search and clustering of dense vectors (Johnson, Douze, and J egou 2017).6 The paper mentions these tools but does not provide specific version numbers for them.
Experiment Setup Yes The word embeddings are 384 dimensional, and the Bi LSTM uses five layers of size 512, respectively. Dropout is set to 0.1. The 1024-dimensional sentence embedding is obtained by maxpooling over the Bi LSTM outputs. The decoder is a 5-layer LSTMs with 2048-dimensional hidden layers, and shared for two target languages: English and Spanish. An additional 32-dimensional embedding layer is used to give the language information to the decoder. We use byte-pair encoding (BPE) with 40k merge operations.