XTC: Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Authors: Xiaoxia Wu, Zhewei Yao, Minjia Zhang, Conglong Li, Yuxiong He

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous works.
Researcher Affiliation Industry Microsoft {xiaoxiawu, zheweiyao, minjiaz, conglong.li, yuxhe}@microsoft.com
Pseudocode No The paper describes its methods in prose and figures, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is released as a part of https://github.com/microsoft/DeepSpeed
Open Datasets Yes All these evaluations are performed with the General Language Understanding Evaluation (GLUE) benchmark [51]
Dataset Splits Yes We report results on the development sets after compressing a pre-trained model (e.g., BERTbase and Tiny BERT) using the corresponding single-task training data.
Hardware Specification No The paper states 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Those are in main text.' However, specific hardware details like GPU/CPU models or explicit cloud instance types are not found in the main text.
Software Dependencies No The paper does not provide specific software dependency versions (e.g., Python 3.8, PyTorch 1.9, CUDA 11.1) in the main text.
Experiment Setup Yes We consider three budgets listed in Table 1, which cover the practical scenarios of short, standard, and long training time... Meanwhile, we also perform a grid search of peak learning rates {2e-5, 1e-4, 5e-4}. For more training details on iterations and batch size per iteration, please see Table C.1.