reproducibilityindex.ai

An Autoencoder Approach to Learning Bilingual Word Representations

Authors: Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, Amrita Saha

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically investigate the success of our approach on the problem of cross-language text classiﬁcation, where a classiﬁer trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.
Researcher Affiliation	Collaboration	1Indian Institute of Technology Madras, 2Universit e de Sherbrooke, 3IBM Research India
Pseudocode	No	The paper describes the autoencoder architecture with mathematical equations and diagrams (Figure 1), but it does not include a block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our word representations and code are available at http://www.sarathchandar.in/crl.html
Open Datasets	Yes	For learning the bilingual embeddings, we used sections of the Europarl corpus [25] which contains roughly 2 million parallel sentences. We considered 3 language pairs. ... As for the labeled document classiﬁcation data sets, they were extracted from sections of the Reuters RCV1/RCV2 corpora, again for the 3 pairs considered in our experiments.
Dataset Splits	Yes	The other hyperparameters were tuned to each task using a training/validation set split of 80% and 20% and using the performance on the validation set of an averaged perceptron trained on the smaller training set portion (notice that this corresponds to a monolingual classiﬁcation experiment, since the general assumption is that no labeled data is available in the test set language).
Hardware Specification	No	The paper does not explicitly describe the hardware used for its experiments (e.g., specific GPU models, CPU types, or memory specifications). It mentions speed improvements related to using GPUs in general terms but not its own setup.
Software Dependencies	No	The paper mentions using 'NLTK [26]' for tokenization but does not provide specific version numbers for NLTK or any other key software dependencies or libraries used in their implementation.
Experiment Setup	Yes	Models were trained for up to 20 epochs using the same data as described earlier. BAE-cr used mini-batch (of size 20) stochastic gradient descent, while BAE-tr used regular stochastic gradient. All results are for word embeddings of size D = 40, as in Klementiev et al. [9]. Further, to speed up the training for BAE-cr we merged each 5 adjacent sentence pairs into a single training instance, as described in Section 2.1. For all language pairs, the joint reconstruction β was ﬁxed to 1 and the cross-lingual correlation factor λ to 4 for BAE-cr.