Training Temporal Word Embeddings with a Compass

Authors: Valerio Di Carlo, Federico Bianchi, Matteo Palmonari6326-6334

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted using stateof-the-art datasets and methodologies suggest that our approach outperforms or equals comparable approaches while being more robust in terms of the required corpus size.
Researcher Affiliation Collaboration 1BUP Solutions, Rome, Italy, 2University of Milan-Bicocca, Milan, Italy
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our experiments can be easily replicated using the source code available online1. 1https://github.com/valedica/twec
Open Datasets Yes The small dataset (Yao et al. 2018) is freely available online2. We will refer to this dataset as News Article Corpus Small (NAC-S). The big dataset is the New York Times Annotated Corpus3 (Sandhaus 2008) employed by Szymanski; Zhang et al. to test their TWEMs. 2https://sites.google.com/site/zijunyaorutgers/publications 3https://catalog.ldc.upenn.edu/ldc2008t19
Dataset Splits Yes MLPC is made available online (Rudolph and Blei 2018) by Rudolph and Blei: the text is already preprocessed, sub-sampled (|V | = 5, 000) and split into training, validation and testing (80%, 10%, 10%);
Hardware Specification No The paper mentions "DBE takes almost 6 hours to train on NAC-S on a 16-core CPU setting." but does not provide specific CPU models, GPU models, or other detailed hardware specifications.
Software Dependencies No The paper mentions using "gensim library" and "tensorflow" but does not specify any version numbers for these software dependencies, making it difficult to reproduce the exact software environment.
Experiment Setup Yes The hyper-parameters reflect those of Yao et al.: small embeddings of size 50, a window of 5 words, 5 negative samples and a small vocabulary of 21k words with at least 200 occurrences over the entire corpus. The settings parameters are similar to those of Szymanski: longer embeddings of size 100, a window size of 5, 5 negative samples and a very large vocabulary of almost 200k words with at least 5 occurrences over the entire corpus. learning rate η = 0.0025, window of size 1, embeddings of size 50 and 10 iterations (5 static and 5 dynamic for TWEC, 1 static and 9 dynamic for DBE as suggested by Rudolph and Blei).