All Word Embeddings from One Embedding
Authors: Sho Takase, Sosuke Kobayashi
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We indicate our ALONE can be used as word representation sufficiently through an experiment on the reconstruction of pre-trained word embeddings. In addition, we also conduct experiments on NLP application tasks: machine translation and summarization. |
| Researcher Affiliation | Collaboration | Sho Takase Tokyo Institute of Technology sho.takase@nlp.c.titech.ac.jp Sosuke Kobayashi Tohoku University Preferred Networks, Inc. sosk@preferred.jp |
| Pseudocode | No | The paper describes the method using mathematical equations and textual descriptions, but it does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | 1The code is publicly available at https://github.com/takase/alone_seq2seq |
| Open Datasets | Yes | We used the pre-trained 300 dimensional Glo Ve3 [22] as source word embeddings and reconstructed them with ALONE. ... We used WMT En-De dataset since it is widely used to evaluate the performance of machine translation [6, 36, 18]. ... We used the DUC 2004 task 1 [20] as the test set. |
| Dataset Splits | Yes | Following previous studies [36, 18], we used WMT 2016 training data, which contains 4.5M sentence pairs, newstest2013, newstest2014 for training, validation, and test respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions software like PyTorch [21] and fairseq [19], but it does not provide specific version numbers for these or other software components necessary for replication. |
| Experiment Setup | Yes | We set mini-batch size 256 and the number of epochs 1000. For c, M, and po in the binary mask, we set 64, 8, and 0.5 respectively. We used the same dimension size as Glo Ve (300) for Do and conducted experiments with varying Dinter in {600, 1200, 1800, 2400}. ... We set Do the same number as the dimension of each layer in the Transformer (dmodel, i.e., 512) and varied Dinter. For other hyper-parameters, we set as follows: c = 64, M = 8, and po = 0.5. Moreover, we applied the dropout after the Re LU activation function in Equation (3). |