A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings

Authors: Liangchen Wei, Zhi-Hong Deng

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results on the task of cross lingual document classification has shown that our method is effective.
Researcher Affiliation Academia Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China liangchen.wei@pku.edu.cn zhdeng@cis.pku.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper.
Open Datasets Yes As our joint space model utilizes parallel corpus only, we train the bilingual embeddings for the English-German language pair using Europarl v7 parallel corpus[Koehn, 2005], and use the induced representations to classify a subset of the English and German sections of the Reuters RCV1/RCV2 multilingual corpora[Lewis et al., 2004] that are assigned to only one of four categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets).
Dataset Splits Yes For the classification experiment, 15000 documents(for each language) were selected randomly by Klementiev[Klementiev et al., 2012] from RCV1/RCV2 corpus. One third of the selected documents(5000) were used as test sets and a varying size between 100 and 10000 of the remainder were used as training set. Another 1000 documents were kept as development set for hyper-parameter tuning.
Hardware Specification No The paper mentions that the model is implemented using TensorFlow but does not specify any hardware details like CPU or GPU models, or memory.
Software Dependencies No The paper mentions 'Tensorflow', 'ADAM', 'dropout and batch normalization' but does not specify any version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes We use 200 units for LSTM memory cell and 40 units for latent variable z, consequently 40 units for the word embeddings.