On the Downstream Performance of Compressed Word Embeddings
Authors: Avner May, Jian Zhang, Tri Dao, Christopher Ré
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our theoretical contributions and the efficacy of our proposed selection criterion by showing three main experimental results: First, we show the eigenspace overlap score is more predictive of downstream performance than existing measures of compression quality [40, 3, 41]. Second, we show uniform quantization consistently matches or outperforms all the compression methods to which we compare [2, 33, 15], in terms of both the eigenspace overlap score and downstream performance. Third, we show the eigenspace overlap score is a more accurate criterion for choosing between compressed embeddings than existing measures; specifically, we show that when choosing between embeddings drawn from a representative set we compressed [2, 33, 11, 15], the eigenspace overlap score is able to identify the one that attains better downstream performance with up to 2 lower selection error rates than the next best measure of compression quality. We consider several baseline measures of compression quality: the Pairwise Inner Product (PIP) loss [40], and two spectral measures of approximation error between the embedding Gram matrices [3, 41]. Our results are consistent across a range of NLP tasks [32, 18, 37], embedding types [28, 23, 10], and compression methods [2, 33, 11]. |
| Researcher Affiliation | Academia | Avner May Jian Zhang Tri Dao Christopher Ré Department of Computer Science, Stanford University {avnermay, zjian, trid, chrismre}@cs.stanford.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide a memory-efficient implementation of the uniform quantization method in https://github. com/Hazy Research/smallfry. |
| Open Datasets | Yes | We evaluate compressed versions of publicly available 300-dimensional fast Text and Glo Ve embeddings on question answering and sentiment analysis tasks, and compressed 768-dimensional Word Piece embeddings from the pre-trained case-sensitive BERTBASE model [10] on tasks from the General Language Understanding Evaluation (GLUE) benchmark [37]. We use the four compression methods discussed in Section 2: DCCL, k-means, dimensionality reduction, and uniform quantization.5 For the tasks, we consider question answering using the Dr QA model [5] on the Stanford Question Answering Dataset (SQu AD) [32], sentiment analysis using a CNN model [18] on all the datasets used by Kim [18], and language understanding using the BERTBASE model on the tasks in the GLUE benchmark [37]. |
| Dataset Splits | No | The paper mentions training, but it does not explicitly state the specific training/validation/test splits (e.g., percentages or sample counts) used for the datasets to allow for direct reproducibility of data partitioning. It refers to models trained in a "standard manner" but lacks explicit details on splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments. It only mentions general concepts like "memory-constrained settings" in the context of the problem. |
| Software Dependencies | No | The paper mentions using PyTorch [26] and Scikit-learn [27] in its references, but it does not specify the version numbers for these or any other software dependencies, which is necessary for reproducible setup. |
| Experiment Setup | Yes | For more details on the various embeddings, tasks, and hyperparameters we use, see Appendix D. |