reproducibilityindex.ai

BilBOWA: Fast Bilingual Distributed Representations without Word Alignments

Authors: Stephan Gouws, Yoshua Bengio, Greg Corrado

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classiﬁcation task as well as a lexical translation task on WMT11 data.we experimentally evaluate the induced cross-lingual embeddings on a document-classiﬁcation ( 5.1) and lexical translation task ( 5.2), where the method outperforms current state-of-the-art methods, with training time reduced to minutes or hours compared to several days for prior approaches;
Researcher Affiliation	Collaboration	Stephan Gouws SGOUWS@GOOGLE.COM Google Inc., Mountain View, CA, USA Yoshua Bengio Dept. IRO, Universit e de Montr eal, QC, Canada & Canadian Institute for Advanced Research Greg Corrado Google Inc., Mountain View, CA, USA
Pseudocode	No	The paper describes the model and training process textually and with equations, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	ﬁnally, we make available our efﬁcient Cimplementation1 to hopefully stimulate further research on cross-lingual distributed feature learning. 1 https://github.com/gouwsmeister/bilbowa
Open Datasets	Yes	For monolingual training data, we use the freely available, pretokenized Wikipedia datasets (Al-Rfou et al., 2013). For cross-lingual training we use the freely-available Europarl v7 corpus (Koehn, 2005).
Dataset Splits	Yes	For the classiﬁcation experiments, 15,000 documents (for each language) were randomly selected from the RCV1/2 corpus, with one third (5,000) used as the test set and the remainder divided into training sets of sizes between 100 and 10,000, and a separate, held-out validation set of 1,000 documents used during the development of our models.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running the experiments (e.g., CPU/GPU models, memory).
Software Dependencies	No	The paper states: "We implemented our model in C by building on the popular open-source word2vec toolkit3.", but it does not specify version numbers for C or the word2vec toolkit.
Experiment Setup	Yes	Embedding matrices were initialized by drawing from a zero mean, unit-variance gaussian distribution. The learning rate was set to 0.1, with linear decay. clipping individual updates to [ 0.1, 0.1] per thread. we set k = 15 which has been shown to give good results.