reproducibilityindex.ai

On the Downstream Performance of Compressed Word Embeddings

Authors: Avner May, Jian Zhang, Tri Dao, Christopher Ré

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our theoretical contributions and the efﬁcacy of our proposed selection criterion by showing three main experimental results: First, we show the eigenspace overlap score is more predictive of downstream performance than existing measures of compression quality [40, 3, 41]. Second, we show uniform quantization consistently matches or outperforms all the compression methods to which we compare [2, 33, 15], in terms of both the eigenspace overlap score and downstream performance. Third, we show the eigenspace overlap score is a more accurate criterion for choosing between compressed embeddings than existing measures; speciﬁcally, we show that when choosing between embeddings drawn from a representative set we compressed [2, 33, 11, 15], the eigenspace overlap score is able to identify the one that attains better downstream performance with up to 2 lower selection error rates than the next best measure of compression quality. We consider several baseline measures of compression quality: the Pairwise Inner Product (PIP) loss [40], and two spectral measures of approximation error between the embedding Gram matrices [3, 41]. Our results are consistent across a range of NLP tasks [32, 18, 37], embedding types [28, 23, 10], and compression methods [2, 33, 11].
Researcher Affiliation	Academia	Avner May Jian Zhang Tri Dao Christopher Ré Department of Computer Science, Stanford University {avnermay, zjian, trid, chrismre}@cs.stanford.edu
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide a memory-efﬁcient implementation of the uniform quantization method in https://github. com/Hazy Research/smallfry.
Open Datasets	Yes	We evaluate compressed versions of publicly available 300-dimensional fast Text and Glo Ve embeddings on question answering and sentiment analysis tasks, and compressed 768-dimensional Word Piece embeddings from the pre-trained case-sensitive BERTBASE model [10] on tasks from the General Language Understanding Evaluation (GLUE) benchmark [37]. We use the four compression methods discussed in Section 2: DCCL, k-means, dimensionality reduction, and uniform quantization.5 For the tasks, we consider question answering using the Dr QA model [5] on the Stanford Question Answering Dataset (SQu AD) [32], sentiment analysis using a CNN model [18] on all the datasets used by Kim [18], and language understanding using the BERTBASE model on the tasks in the GLUE benchmark [37].
Dataset Splits	No	The paper mentions training, but it does not explicitly state the specific training/validation/test splits (e.g., percentages or sample counts) used for the datasets to allow for direct reproducibility of data partitioning. It refers to models trained in a "standard manner" but lacks explicit details on splits.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only mentions general concepts like "memory-constrained settings" in the context of the problem.
Software Dependencies	No	The paper mentions using PyTorch [26] and Scikit-learn [27] in its references, but it does not specify the version numbers for these or any other software dependencies, which is necessary for reproducible setup.
Experiment Setup	Yes	For more details on the various embeddings, tasks, and hyperparameters we use, see Appendix D.