A Variational Autoencoding Approach for Inducing Cross-lingual Word Embeddings
Authors: Liangchen Wei, Zhi-Hong Deng
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on the task of cross lingual document classification has shown that our method is effective. |
| Researcher Affiliation | Academia | Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China liangchen.wei@pku.edu.cn zhdeng@cis.pku.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper. |
| Open Datasets | Yes | As our joint space model utilizes parallel corpus only, we train the bilingual embeddings for the English-German language pair using Europarl v7 parallel corpus[Koehn, 2005], and use the induced representations to classify a subset of the English and German sections of the Reuters RCV1/RCV2 multilingual corpora[Lewis et al., 2004] that are assigned to only one of four categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social), and MCAT (Markets). |
| Dataset Splits | Yes | For the classification experiment, 15000 documents(for each language) were selected randomly by Klementiev[Klementiev et al., 2012] from RCV1/RCV2 corpus. One third of the selected documents(5000) were used as test sets and a varying size between 100 and 10000 of the remainder were used as training set. Another 1000 documents were kept as development set for hyper-parameter tuning. |
| Hardware Specification | No | The paper mentions that the model is implemented using TensorFlow but does not specify any hardware details like CPU or GPU models, or memory. |
| Software Dependencies | No | The paper mentions 'Tensorflow', 'ADAM', 'dropout and batch normalization' but does not specify any version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | We use 200 units for LSTM memory cell and 40 units for latent variable z, consequently 40 units for the word embeddings. |