XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient
Authors: Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works. |
| Researcher Affiliation | Industry | Microsoft {xiaoxiawu, zheweiyao, minjiaz, conglong.li, yuxhe}@microsoft.com |
| Pseudocode | No | The paper describes its methods in prose and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Code is released as a part of https://github.com/microsoft/DeepSpeed |
| Open Datasets | Yes | All these evaluations are performed with the General Language Understanding Evaluation (GLUE) benchmark [51] |
| Dataset Splits | Yes | We report results on the development sets after compressing a pre-trained model (e.g., BERTbase and Tiny BERT) using the corresponding single-task training data. |
| Hardware Specification | No | The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Those are in main text.' However, specific hardware details like GPU/CPU models or explicit cloud instance types are not found in the main text. |
| Software Dependencies | No | The paper does not provide specific software dependency versions (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) in the main text. |
| Experiment Setup | Yes | We consider three budgets listed in Table 1, which cover the practical scenarios of short, standard, and long training time... Meanwhile, we also perform a grid search of peak learning rates {2e-5, 1e-4, 5e-4}. For more training details on iterations and batch size per iteration, please see Table C.1. |