All Word Embeddings from One Embedding

Authors: Sho Takase, Sosuke Kobayashi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We indicate our ALONE can be used as word representation sufficiently through an experiment on the reconstruction of pre-trained word embeddings. In addition, we also conduct experiments on NLP application tasks: machine translation and summarization.
Researcher Affiliation Collaboration Sho Takase Tokyo Institute of Technology sho.takase@nlp.c.titech.ac.jp Sosuke Kobayashi Tohoku University Preferred Networks, Inc. sosk@preferred.jp
Pseudocode No The paper describes the method using mathematical equations and textual descriptions, but it does not contain a clearly labeled pseudocode or algorithm block.
Open Source Code Yes 1The code is publicly available at https://github.com/takase/alone_seq2seq
Open Datasets Yes We used the pre-trained 300 dimensional Glo Ve3 [22] as source word embeddings and reconstructed them with ALONE. ... We used WMT En-De dataset since it is widely used to evaluate the performance of machine translation [6, 36, 18]. ... We used the DUC 2004 task 1 [20] as the test set.
Dataset Splits Yes Following previous studies [36, 18], we used WMT 2016 training data, which contains 4.5M sentence pairs, newstest2013, newstest2014 for training, validation, and test respectively.
Hardware Specification No The paper does not provide specific hardware details (such as exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions software like PyTorch [21] and fairseq [19], but it does not provide specific version numbers for these or other software components necessary for replication.
Experiment Setup Yes We set mini-batch size 256 and the number of epochs 1000. For c, M, and po in the binary mask, we set 64, 8, and 0.5 respectively. We used the same dimension size as Glo Ve (300) for Do and conducted experiments with varying Dinter in {600, 1200, 1800, 2400}. ... We set Do the same number as the dimension of each layer in the Transformer (dmodel, i.e., 512) and varied Dinter. For other hyper-parameters, we set as follows: c = 64, M = 8, and po = 0.5. Moreover, we applied the dropout after the Re LU activation function in Equation (3).