An Autoencoder Approach to Learning Bilingual Word Representations

Authors: Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, Amrita Saha

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.
Researcher Affiliation Collaboration 1Indian Institute of Technology Madras, 2Universit e de Sherbrooke, 3IBM Research India
Pseudocode No The paper describes the autoencoder architecture with mathematical equations and diagrams (Figure 1), but it does not include a block labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Our word representations and code are available at http://www.sarathchandar.in/crl.html
Open Datasets Yes For learning the bilingual embeddings, we used sections of the Europarl corpus [25] which contains roughly 2 million parallel sentences. We considered 3 language pairs. ... As for the labeled document classification data sets, they were extracted from sections of the Reuters RCV1/RCV2 corpora, again for the 3 pairs considered in our experiments.
Dataset Splits Yes The other hyperparameters were tuned to each task using a training/validation set split of 80% and 20% and using the performance on the validation set of an averaged perceptron trained on the smaller training set portion (notice that this corresponds to a monolingual classification experiment, since the general assumption is that no labeled data is available in the test set language).
Hardware Specification No The paper does not explicitly describe the hardware used for its experiments (e.g., specific GPU models, CPU types, or memory specifications). It mentions speed improvements related to using GPUs in general terms but not its own setup.
Software Dependencies No The paper mentions using 'NLTK [26]' for tokenization but does not provide specific version numbers for NLTK or any other key software dependencies or libraries used in their implementation.
Experiment Setup Yes Models were trained for up to 20 epochs using the same data as described earlier. BAE-cr used mini-batch (of size 20) stochastic gradient descent, while BAE-tr used regular stochastic gradient. All results are for word embeddings of size D = 40, as in Klementiev et al. [9]. Further, to speed up the training for BAE-cr we merged each 5 adjacent sentence pairs into a single training instance, as described in Section 2.1. For all language pairs, the joint reconstruction β was fixed to 1 and the cross-lingual correlation factor λ to 4 for BAE-cr.