Training Temporal Word Embeddings with a Compass
Authors: Valerio Di Carlo, Federico Bianchi, Matteo Palmonari6326-6334
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted using stateof-the-art datasets and methodologies suggest that our approach outperforms or equals comparable approaches while being more robust in terms of the required corpus size. |
| Researcher Affiliation | Collaboration | 1BUP Solutions, Rome, Italy, 2University of Milan-Bicocca, Milan, Italy |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our experiments can be easily replicated using the source code available online1. 1https://github.com/valedica/twec |
| Open Datasets | Yes | The small dataset (Yao et al. 2018) is freely available online2. We will refer to this dataset as News Article Corpus Small (NAC-S). The big dataset is the New York Times Annotated Corpus3 (Sandhaus 2008) employed by Szymanski; Zhang et al. to test their TWEMs. 2https://sites.google.com/site/zijunyaorutgers/publications 3https://catalog.ldc.upenn.edu/ldc2008t19 |
| Dataset Splits | Yes | MLPC is made available online (Rudolph and Blei 2018) by Rudolph and Blei: the text is already preprocessed, sub-sampled (|V | = 5, 000) and split into training, validation and testing (80%, 10%, 10%); |
| Hardware Specification | No | The paper mentions "DBE takes almost 6 hours to train on NAC-S on a 16-core CPU setting." but does not provide specific CPU models, GPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions using "gensim library" and "tensorflow" but does not specify any version numbers for these software dependencies, making it difficult to reproduce the exact software environment. |
| Experiment Setup | Yes | The hyper-parameters reflect those of Yao et al.: small embeddings of size 50, a window of 5 words, 5 negative samples and a small vocabulary of 21k words with at least 200 occurrences over the entire corpus. The settings parameters are similar to those of Szymanski: longer embeddings of size 100, a window size of 5, 5 negative samples and a very large vocabulary of almost 200k words with at least 5 occurrences over the entire corpus. learning rate η = 0.0025, window of size 1, embeddings of size 50 and 10 iterations (5 static and 5 dynamic for TWEC, 1 static and 9 dynamic for DBE as suggested by Rudolph and Blei). |