ESPACE: Dimensionality Reduction of Activations for Model Compression
Authors: Charbel Sakr, Brucek Khailany
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we report on experimental studies investigating LLM compression using ESPACE. Accuracy is evaluated in two ways: perplexity measured on the Wikitext-103 dataset [36] and zero-shot downstream task accuracy of: Bool Q (BQ) [37], Hellaswag (HS) [38], PIQA (PQ) [39], RACE (RA) [40], and Wino Grande (WG) [41]. |
| Researcher Affiliation | Industry | Charbel Sakr NVIDIA Research csakr@nvidia.com Brucek Khailany NVIDIA Research bkhailany@nvidia.com |
| Pseudocode | No | The paper does not contain any clearly labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | No | As such, we believe the description of the work in the paper is sufficient for reproducibility; yet, we are happy to consider open sourcing our code in the future. |
| Open Datasets | Yes | Accuracy is evaluated in two ways: perplexity measured on the Wikitext-103 dataset [36] and zero-shot downstream task accuracy of: Bool Q (BQ) [37], Hellaswag (HS) [38], PIQA (PQ) [39], RACE (RA) [40], and Wino Grande (WG) [41]. ... Retraining simply extends the models pre-training sessions and uses the 330B-token MTNLG dataset [43], which was used to train GPT3 models. |
| Dataset Splits | Yes | The Wikitext-103 dataset is split into train, validation, and test sets. We use 512 random sequences from the training set for calibrating projection matrices required by ESPACE. We use the validation set for layer-wise sensitivity studies. |
| Hardware Specification | Yes | We measure using a NVIDIA A100 GPU and a simple, un-optimized implementation (see Appendix B.4). |
| Software Dependencies | No | Our implementation is built on top of Megatron-LM [33] which itself is based on the Pytorch framework. ... We then use the CUPY library in RAPIDS to perform fast (a few milliseconds per auto-correlation matrix) eigenvalue decomposition on GPUs. (Specific version numbers for these software components are not provided.) |
| Experiment Setup | Yes | For GPT3-1.3B, the initial learning rate is set to 1.0 10-4, the final learning rate is set to 1.0 10-5, and the global batch size is set to 512. (Similar details are provided for other models in Appendix B.3). |