Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
BilBOWA: Fast Bilingual Distributed Representations without Word Alignments
Authors: Stephan Gouws, Yoshua Bengio, Greg Corrado
ICML 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that bilingual embeddings learned using the proposed model outperform state-of-the-art methods on a cross-lingual document classification task as well as a lexical translation task on WMT11 data.we experimentally evaluate the induced cross-lingual embeddings on a document-classification ( 5.1) and lexical translation task ( 5.2), where the method outperforms current state-of-the-art methods, with training time reduced to minutes or hours compared to several days for prior approaches; |
| Researcher Affiliation | Collaboration | Stephan Gouws EMAIL Google Inc., Mountain View, CA, USA Yoshua Bengio Dept. IRO, Universit e de Montr eal, QC, Canada & Canadian Institute for Advanced Research Greg Corrado Google Inc., Mountain View, CA, USA |
| Pseudocode | No | The paper describes the model and training process textually and with equations, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | finally, we make available our efficient Cimplementation1 to hopefully stimulate further research on cross-lingual distributed feature learning. 1 https://github.com/gouwsmeister/bilbowa |
| Open Datasets | Yes | For monolingual training data, we use the freely available, pretokenized Wikipedia datasets (Al-Rfou et al., 2013). For cross-lingual training we use the freely-available Europarl v7 corpus (Koehn, 2005). |
| Dataset Splits | Yes | For the classification experiments, 15,000 documents (for each language) were randomly selected from the RCV1/2 corpus, with one third (5,000) used as the test set and the remainder divided into training sets of sizes between 100 and 10,000, and a separate, held-out validation set of 1,000 documents used during the development of our models. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper states: "We implemented our model in C by building on the popular open-source word2vec toolkit3.", but it does not specify version numbers for C or the word2vec toolkit. |
| Experiment Setup | Yes | Embedding matrices were initialized by drawing from a zero mean, unit-variance gaussian distribution. The learning rate was set to 0.1, with linear decay. clipping individual updates to [ 0.1, 0.1] per thread. we set k = 15 which has been shown to give good results. |