BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Authors: Stephan Gouws, Yoshua Bengio, Greg Corrado

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.we experimentally evaluate the induced cross-lingual embeddings on a document-classification ( 5.1) and lexical translation task ( 5.2), where the method outperforms current state-of-the-art methods, with training time reduced to minutes or hours compared to several days for prior approaches;
Researcher Affiliation Collaboration Stephan Gouws SGOUWS@GOOGLE.COM Google Inc., Mountain View, CA, USA Yoshua Bengio Dept. IRO, Universit e de Montr eal, QC, Canada & Canadian Institute for Advanced Research Greg Corrado Google Inc., Mountain View, CA, USA
Pseudocode No The paper describes the model and training process textually and with equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes finally, we make available our efficient Cimplementation1 to hopefully stimulate further research on cross-lingual distributed feature learning. 1 https://github.com/gouwsmeister/bilbowa
Open Datasets Yes For monolingual training data, we use the freely available, pretokenized Wikipedia datasets (Al-Rfou et al., 2013). For cross-lingual training we use the freely-available Europarl v7 corpus (Koehn, 2005).
Dataset Splits Yes For the classification experiments, 15,000 documents (for each language) were randomly selected from the RCV1/2 corpus, with one third (5,000) used as the test set and the remainder divided into training sets of sizes between 100 and 10,000, and a separate, held-out validation set of 1,000 documents used during the development of our models.
Hardware Specification No The paper does not provide specific details on the hardware used for running the experiments (e.g., CPU/GPU models, memory).
Software Dependencies No The paper states: "We implemented our model in C by building on the popular open-source word2vec toolkit3.", but it does not specify version numbers for C or the word2vec toolkit.
Experiment Setup Yes Embedding matrices were initialized by drawing from a zero mean, unit-variance gaussian distribution. The learning rate was set to 0.1, with linear decay. clipping individual updates to [ 0.1, 0.1] per thread. we set k = 15 which has been shown to give good results.