Differentially Private Fine-tuning of Language Models
Authors: Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally evaluate our methods for DP fine-tuning to demonstrate their utility, privacy, and parameter-efficiency. |
| Researcher Affiliation | Collaboration | 1Sun Yat-sen University , 2Microsoft Research Asia 3Microsoft Research, 4Microsoft 5Cheriton School of Computer Science, University of Waterloo 6University of Washington |
| Pseudocode | No | The paper describes algorithms and methods in text and mathematical formulations but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/Anonymous AKES/ Differentially-Private-Fine-tuning-of-Language-Models. |
| Open Datasets | Yes | We use Ro BERTa models (Liu et al., 2019), which are pre-trained on public data collected from the web. We choose four downstream tasks: MNLI, QQP, QNLI, and SST-2 from GLUE (Wang et al., 2018), following Yu et al. (2021b). |
| Dataset Splits | Yes | The E2E dataset in Novikova et al. (2017) contains template-like information in the restaurant domain to be mapped to natural language with end-to-end training. The dataset consists of 42K training samples, 4.6K validation samples, and 4.6K test samples. |
| Hardware Specification | Yes | Table 2: Memory and speed comparison for Ro BERTa-Large. ... The speed is measured by the wall-clock time for training one epoch of the SST-2 dataset on a single Tesla V100 GPU with gradient accumulation for batch size 2000. |
| Software Dependencies | No | The paper mentions optimizers like Adam W and privacy accounting methods like Gopi et al. (2021)'s PRV accountant, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | Hyperparameter choice: Given the large number of hyperparameter choices, e.g., the intermediate representation dimension, learning rate, weight decay, privacy parameter δ, and model size, an exhaustive grid search over all hyperparameters is expensive. Our hyperparameter choices are informed by prior work and are as follows. For privacy parameters, we use δ = 1e-5 for SST-2 and QNLI and δ = 1e-6 for QQP and MNLI due to their dataset sizes, and use noise multipliers 0.92, 0.83, 0.66 and 0.65 for SST-2, QNLI, QQP, and MNLI, respectively... The clipping threshold is 10 for all methods. The batch size is 2000. ... We train for 20 epochs using Adam W (Loshchilov & Hutter, 2019) with weight decay 1e-2 and search over four learning rates {5e-4, 1e-3, 2e-3, 5e-3}. |