On the Downstream Performance of Compressed Word Embeddings

Authors: Avner May, Jian Zhang, Tri Dao, Christopher Ré

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our theoretical contributions and the efficacy of our proposed selection criterion by showing three main experimental results: First, we show the eigenspace overlap score is more predictive of downstream performance than existing measures of compression quality [40, 3, 41]. Second, we show uniform quantization consistently matches or outperforms all the compression methods to which we compare [2, 33, 15], in terms of both the eigenspace overlap score and downstream performance. Third, we show the eigenspace overlap score is a more accurate criterion for choosing between compressed embeddings than existing measures; specifically, we show that when choosing between embeddings drawn from a representative set we compressed [2, 33, 11, 15], the eigenspace overlap score is able to identify the one that attains better downstream performance with up to 2 lower selection error rates than the next best measure of compression quality. We consider several baseline measures of compression quality: the Pairwise Inner Product (PIP) loss [40], and two spectral measures of approximation error between the embedding Gram matrices [3, 41]. Our results are consistent across a range of NLP tasks [32, 18, 37], embedding types [28, 23, 10], and compression methods [2, 33, 11].
Researcher Affiliation Academia Avner May Jian Zhang Tri Dao Christopher Ré Department of Computer Science, Stanford University {avnermay, zjian, trid, chrismre}@cs.stanford.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We provide a memory-efficient implementation of the uniform quantization method in https://github. com/Hazy Research/smallfry.
Open Datasets Yes We evaluate compressed versions of publicly available 300-dimensional fast Text and Glo Ve embeddings on question answering and sentiment analysis tasks, and compressed 768-dimensional Word Piece embeddings from the pre-trained case-sensitive BERTBASE model [10] on tasks from the General Language Understanding Evaluation (GLUE) benchmark [37]. We use the four compression methods discussed in Section 2: DCCL, k-means, dimensionality reduction, and uniform quantization.5 For the tasks, we consider question answering using the Dr QA model [5] on the Stanford Question Answering Dataset (SQu AD) [32], sentiment analysis using a CNN model [18] on all the datasets used by Kim [18], and language understanding using the BERTBASE model on the tasks in the GLUE benchmark [37].
Dataset Splits No The paper mentions training, but it does not explicitly state the specific training/validation/test splits (e.g., percentages or sample counts) used for the datasets to allow for direct reproducibility of data partitioning. It refers to models trained in a "standard manner" but lacks explicit details on splits.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only mentions general concepts like "memory-constrained settings" in the context of the problem.
Software Dependencies No The paper mentions using PyTorch [26] and Scikit-learn [27] in its references, but it does not specify the version numbers for these or any other software dependencies, which is necessary for reproducible setup.
Experiment Setup Yes For more details on the various embeddings, tasks, and hyperparameters we use, see Appendix D.