Differentially Private Fine-tuning of Language Models

Authors: Da Yu, Saurabh Naik, Arturs Backurs, Sivakanth Gopi, Huseyin A Inan, Gautam Kamath, Janardhan Kulkarni, Yin Tat Lee, Andre Manoel, Lukas Wutschitz, Sergey Yekhanin, Huishuai Zhang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluate our methods for DP fine-tuning to demonstrate their utility, privacy, and parameter-efficiency.
Researcher Affiliation Collaboration 1Sun Yat-sen University , 2Microsoft Research Asia 3Microsoft Research, 4Microsoft 5Cheriton School of Computer Science, University of Waterloo 6University of Washington
Pseudocode No The paper describes algorithms and methods in text and mathematical formulations but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured steps.
Open Source Code Yes Our code is publicly available at https://github.com/Anonymous AKES/ Differentially-Private-Fine-tuning-of-Language-Models.
Open Datasets Yes We use Ro BERTa models (Liu et al., 2019), which are pre-trained on public data collected from the web. We choose four downstream tasks: MNLI, QQP, QNLI, and SST-2 from GLUE (Wang et al., 2018), following Yu et al. (2021b).
Dataset Splits Yes The E2E dataset in Novikova et al. (2017) contains template-like information in the restaurant domain to be mapped to natural language with end-to-end training. The dataset consists of 42K training samples, 4.6K validation samples, and 4.6K test samples.
Hardware Specification Yes Table 2: Memory and speed comparison for Ro BERTa-Large. ... The speed is measured by the wall-clock time for training one epoch of the SST-2 dataset on a single Tesla V100 GPU with gradient accumulation for batch size 2000.
Software Dependencies No The paper mentions optimizers like Adam W and privacy accounting methods like Gopi et al. (2021)'s PRV accountant, but it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes Hyperparameter choice: Given the large number of hyperparameter choices, e.g., the intermediate representation dimension, learning rate, weight decay, privacy parameter δ, and model size, an exhaustive grid search over all hyperparameters is expensive. Our hyperparameter choices are informed by prior work and are as follows. For privacy parameters, we use δ = 1e-5 for SST-2 and QNLI and δ = 1e-6 for QQP and MNLI due to their dataset sizes, and use noise multipliers 0.92, 0.83, 0.66 and 0.65 for SST-2, QNLI, QQP, and MNLI, respectively... The clipping threshold is 10 for all methods. The batch size is 2000. ... We train for 20 epochs using Adam W (Loshchilov & Hutter, 2019) with weight decay 1e-2 and search over four learning rates {5e-4, 1e-3, 2e-3, 5e-3}.