Differentially Private Model Compression
Authors: FatemehSadat Mireshghallah, Arturs Backurs, Huseyin A. Inan, Lukas Wutschitz, Janardhan Kulkarni
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluation of these ideas on standard GLUE benchmarks using BERT models show that DPKD approach to the model compression loses an accuracy of 5% compared to the larger models if the compressed model has half the size of the full BERT model. |
| Researcher Affiliation | Collaboration | 1 University of California San Diego, 2 Microsoft Research, 3 Microsoft |
| Pseudocode | Yes | Algorithm 1 Differentially Private Knowledge Distillation (DPKD), Algorithm 2 Structured DPIMP, Algorithm 3 Unstructured DPIMP |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | Following prior work [77, 34, 75], we experiment with the following set of 4 tasks from the GLUE benchmark [68]: MNLI (Multi-Genre Natural Language Inference Corpus), QQP (Quora Question Pairs), QNLI (Stanford Question Answering Dataset) and SST-2 (Stanford Sentiment Treebank). |
| Dataset Splits | Yes | Following prior work [77, 34, 75], we experiment with the following set of 4 tasks from the GLUE benchmark [68]: MNLI (Multi-Genre Natural Language Inference Corpus), QQP (Quora Question Pairs), QNLI (Stanford Question Answering Dataset) and SST-2 (Stanford Sentiment Treebank). |
| Hardware Specification | Yes | All our experiments were conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | The code is written in Python using PyTorch and Huggingface libraries. The paper mentions software names but does not provide specific version numbers for reproducibility. |
| Experiment Setup | Yes | We describe the software and hardware specifications in Appendix A.1 and hyperparameter settings in Appendix A.2. The number of epochs for fine-tuning models are 10, except for QQP and SST-2 datasets, where it is 5. We use batch size of 256 for all experiments. The learning rate for DPSGD fine-tuning is 5e-5. We use noise multiplier of 0.1 for all experiments and max grad norm of 1.0. |