Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Eigenwords: Spectral Word Embeddings

Authors: Paramveer S. Dhillon, Dean P. Foster, Lyle H. Ungar

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also perform thorough qualitative and quantitative evaluation of Eigenwords showing that simple linear approaches give performance comparable to or superior than the state-of-the-art non-linear deep learning based methods.
Researcher Affiliation	Academia	Paramveer S. Dhillon EMAIL Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02142, USA Dean P. Foster EMAIL Department of Statistics The Wharton School, University of Pennsylvania Philadelphia, PA 19104, USA Lyle H. Ungar EMAIL Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104, USA
Pseudocode	Yes	Algorithm 1 Two step CCA Algorithm 2 LR-MVL Algorithm Learning from Large amounts of Unlabeled Data (no exponential smooths). Algorithm 3 LR-MVL Algorithm Learning from Large amounts of Unlabeled Data (with exponential smooths). Algorithm 4 Inducing Context Speciﬁc Embeddings for Train/Dev/Test Data Algorithm 5 Randomized singular value decomposition
Open Source Code	Yes	Eigenwords Reuters RCV1 uncleaned and case intact 2 100,000 200 Code and Embeddings available
Open Datasets	Yes	In the results presented below (qualitative and quantitative), we trained all the algorithms (including eigenwords) on Reuters RCV1 corpus (Rose et al., 2002) for uniformity of comparison8. Case was left intact and we did not do any other cleaning of data. Tokenization was performed using NLTK tokenizer (Bird and Loper, 2004). For this task, it is interesting to see how well the cosine similarity between the word embeddings correlates with the human judgment of similarity between the same two words. The results in Table 4 show the Spearman s correlation between the cosine similarity of the respective word embeddings and the human judgments. A standard data set for evaluating vector-space models is the Word Sim-353 data set (Finkelstein et al., 2001), which consists of 353 pairs of nouns. Table 2 provides statistics on all the corpora used, namely: the Wall Street Journal portion of the Penn treebank (Marcus et al., 1993) (we consider the 17 tags of (PTB 17) (Smith and Eisner, 2005)), the Bosque subset of the Portuguese Floresta Sinta(c)tica Treebank (Afonso et al., 2002), the Bulgarian Bul Tree Bank (Simov et al., 2002) (with only the 12 coarse tags), and the Danish Dependency Treebank (DDT) (Kromann, 2003). For the NER experiments we used the data from Co NLL 2003 shared task and for chunking experiments we used the Co NLL 2000 shared task data12 with standard training, development and testing set splits. We focus on the Sem Eval 2013 cross-lingual WSD task (Lefever and Hoste, 2013), for which 20 English nouns were chosen for disambiguation. This was framed as an unsupervised task, in which the only provided training data was a sentence-aligned subset of the Europarl parallel corpus (Koehn, 2005). (Mikolov et al., 2013a,b) present new syntactic and semantic relation data sets composed of analogous word pairs.
Dataset Splits	Yes	We trained using 80% of the word types chosen randomly and then tested on the remaining 20% types. This procedure was repeated 10 times. The Co NLL 03 and the Co NLL 00 data sets had 204K/51K/46K and 212K/ /47K tokens respectively for Train/Dev./Test sets. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF13 as the supervised classiﬁer. The Sem Eval 2010 trial data was used to select appropriate regularization parameters, and the Sem Eval 2010 test data was used for the ﬁnal evaluations.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	Tokenization was performed using NLTK tokenizer (Bird and Loper, 2004). The MEGA Model Optimization Package (Mega M) (Daumé III, 2004) and its NLTK interface (Bird et al., 2009) were used for training the models and producing output for the test sentences. We also used their BILOU text chunk representation and fast greedy inference, as it was shown to give superior performance. We also benchmark the performance of eigenwords on MUC7 out-of-domain dataset which had 59K words. Since the Co NLL 00 chunking data does not have a development set, we randomly sampled 1000 sentences from the training data (8936 sentences) for development. So, we trained our chunking models on 7936 training sentences and evaluated their F1 score on the 1000 development sentences and used a CRF13 as the supervised classiﬁer.
Experiment Setup	Yes	Unless otherwise stated, we consider a ﬁxed window of two words (h=2) on either side of a given word and a vocabulary of 100,000 most frequent words for all the algorithms9, in order to ensure fairness of comparison. Eigenword algorithms are robust to the dimensionality of hidden space (k), so we did not tune it and ﬁxed it at 200. For other algorithms, we report results using their best hidden space dimensionality. So, we took the square root of the word counts in the context matrices (i.e. W C) before running OSCCA, TSCCA and LR-MVL(I). This squishes the word distributions and makes them look more normal (Gaussian). We ran LR-MVL(I) and LR-MVL(II) for 5 iterations and only used one exponential smooth of 0.5 for LR-MVL(II). Following (Ratinov and Roth, 2009) we use a regularized averaged perceptron model with the above set of baseline features for the NER task. We also used their BILOU text chunk representation and fast greedy inference, as it was shown to give superior performance. We tuned the magnitude of the ℓ2 regularization penalty in CRF on the development set. The regularization penalty that gave best performance on development set was 2. First, regularization was introduced in the form of a Gaussian prior by setting the sigma parameter in NLTK s Mega M interface to a nonzero value. Second, always-on features were enabled, allowing the classiﬁer to explicitly model the prior probabilities of each output label.