Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Differentially Private Model Compression
Authors: FatemehSadat Mireshghallah, Arturs Backurs, Huseyin A. Inan, Lukas Wutschitz, Janardhan Kulkarni
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical evaluation of these ideas on standard GLUE benchmarks using BERT models show that DPKD approach to the model compression loses an accuracy of 5% compared to the larger models if the compressed model has half the size of the full BERT model. |
| Researcher Affiliation | Collaboration | 1 University of California San Diego, 2 Microsoft Research, 3 Microsoft |
| Pseudocode | Yes | Algorithm 1 Differentially Private Knowledge Distillation (DPKD), Algorithm 2 Structured DPIMP, Algorithm 3 Unstructured DPIMP |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] |
| Open Datasets | Yes | Following prior work [77, 34, 75], we experiment with the following set of 4 tasks from the GLUE benchmark [68]: MNLI (Multi-Genre Natural Language Inference Corpus), QQP (Quora Question Pairs), QNLI (Stanford Question Answering Dataset) and SST-2 (Stanford Sentiment Treebank). |
| Dataset Splits | Yes | Following prior work [77, 34, 75], we experiment with the following set of 4 tasks from the GLUE benchmark [68]: MNLI (Multi-Genre Natural Language Inference Corpus), QQP (Quora Question Pairs), QNLI (Stanford Question Answering Dataset) and SST-2 (Stanford Sentiment Treebank). |
| Hardware Specification | Yes | All our experiments were conducted on NVIDIA A100 GPUs. |
| Software Dependencies | No | The code is written in Python using PyTorch and Huggingface libraries. The paper mentions software names but does not provide specific version numbers for reproducibility. |
| Experiment Setup | Yes | We describe the software and hardware specifications in Appendix A.1 and hyperparameter settings in Appendix A.2. The number of epochs for fine-tuning models are 10, except for QQP and SST-2 datasets, where it is 5. We use batch size of 256 for all experiments. The learning rate for DPSGD fine-tuning is 5e-5. We use noise multiplier of 0.1 for all experiments and max grad norm of 1.0. |