An Autoencoder Approach to Learning Bilingual Word Representations
Authors: Sarath Chandar A P, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, Amrita Saha
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e.g., English) must learn to generalize to a different language (e.g., German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology Madras, 2Universit e de Sherbrooke, 3IBM Research India |
| Pseudocode | No | The paper describes the autoencoder architecture with mathematical equations and diagrams (Figure 1), but it does not include a block labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Our word representations and code are available at http://www.sarathchandar.in/crl.html |
| Open Datasets | Yes | For learning the bilingual embeddings, we used sections of the Europarl corpus [25] which contains roughly 2 million parallel sentences. We considered 3 language pairs. ... As for the labeled document classification data sets, they were extracted from sections of the Reuters RCV1/RCV2 corpora, again for the 3 pairs considered in our experiments. |
| Dataset Splits | Yes | The other hyperparameters were tuned to each task using a training/validation set split of 80% and 20% and using the performance on the validation set of an averaged perceptron trained on the smaller training set portion (notice that this corresponds to a monolingual classification experiment, since the general assumption is that no labeled data is available in the test set language). |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for its experiments (e.g., specific GPU models, CPU types, or memory specifications). It mentions speed improvements related to using GPUs in general terms but not its own setup. |
| Software Dependencies | No | The paper mentions using 'NLTK [26]' for tokenization but does not provide specific version numbers for NLTK or any other key software dependencies or libraries used in their implementation. |
| Experiment Setup | Yes | Models were trained for up to 20 epochs using the same data as described earlier. BAE-cr used mini-batch (of size 20) stochastic gradient descent, while BAE-tr used regular stochastic gradient. All results are for word embeddings of size D = 40, as in Klementiev et al. [9]. Further, to speed up the training for BAE-cr we merged each 5 adjacent sentence pairs into a single training instance, as described in Section 2.1. For all language pairs, the joint reconstruction β was fixed to 1 and the cross-lingual correlation factor λ to 4 for BAE-cr. |