reproducibilityindex.ai

Training Temporal Word Embeddings with a Compass

Authors: Valerio Di Carlo, Federico Bianchi, Matteo Palmonari6326-6334

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments conducted using stateof-the-art datasets and methodologies suggest that our approach outperforms or equals comparable approaches while being more robust in terms of the required corpus size.
Researcher Affiliation	Collaboration	1BUP Solutions, Rome, Italy, 2University of Milan-Bicocca, Milan, Italy
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our experiments can be easily replicated using the source code available online1. 1https://github.com/valedica/twec
Open Datasets	Yes	The small dataset (Yao et al. 2018) is freely available online2. We will refer to this dataset as News Article Corpus Small (NAC-S). The big dataset is the New York Times Annotated Corpus3 (Sandhaus 2008) employed by Szymanski; Zhang et al. to test their TWEMs. 2https://sites.google.com/site/zijunyaorutgers/publications 3https://catalog.ldc.upenn.edu/ldc2008t19
Dataset Splits	Yes	MLPC is made available online (Rudolph and Blei 2018) by Rudolph and Blei: the text is already preprocessed, sub-sampled (\|V \| = 5, 000) and split into training, validation and testing (80%, 10%, 10%);
Hardware Specification	No	The paper mentions "DBE takes almost 6 hours to train on NAC-S on a 16-core CPU setting." but does not provide specific CPU models, GPU models, or other detailed hardware specifications.
Software Dependencies	No	The paper mentions using "gensim library" and "tensorﬂow" but does not specify any version numbers for these software dependencies, making it difficult to reproduce the exact software environment.
Experiment Setup	Yes	The hyper-parameters reﬂect those of Yao et al.: small embeddings of size 50, a window of 5 words, 5 negative samples and a small vocabulary of 21k words with at least 200 occurrences over the entire corpus. The settings parameters are similar to those of Szymanski: longer embeddings of size 100, a window size of 5, 5 negative samples and a very large vocabulary of almost 200k words with at least 5 occurrences over the entire corpus. learning rate η = 0.0025, window of size 1, embeddings of size 50 and 10 iterations (5 static and 5 dynamic for TWEC, 1 static and 9 dynamic for DBE as suggested by Rudolph and Blei).