Sparsity-Preserving Differentially Private Training of Large Embedding Models
Authors: Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Amer Sinha, Chiyuan Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we evaluate the performance of our sparsity-preserving training algorithms and compare them against vanilla DP-SGD on both the recommendation and language understanding tasks (Section 4.1). |
| Researcher Affiliation | Collaboration | Badih Ghazi Google Research Mountain View, CA badihghazi@gmail.com Yangsibo Huang Princeton University Princeton, NJ yangsibo@princeton.edu |
| Pseudocode | Yes | Algorithm 1: DP-Ada FEST: Adaptive Filtering Applied Sparse Training. |
| Open Source Code | No | The paper references 'Google s DP library' and provides a link to its GitHub repository for privacy accounting, but it does not state that the code for the methodology described in this paper is open-source or publicly available. |
| Open Datasets | Yes | We evaluate our algorithms on the widely-used Criteo predicted click-through rate (p CTR) dataset4, which includes over four billion ad impressions over 24 days. ... 4https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset ... For language understanding tasks, we employ language models from the BERT family [DCLT19]... We fine-tune the RoBERTa model for downstream classification tasks from the GLUE benchmark [WSM+19], including SST-2 [SPW+13], QNLI [RZLL16], and QQP [IYW+17]. |
| Dataset Splits | Yes | To simulate a real-world online training scenario, we train the model on the first 18 days of data and evaluate it on subsequent days (i.e., days 19 24); |
| Hardware Specification | No | The paper mentions 'Google TPUs' as a motivation for sparsity but does not provide specific hardware details (like exact GPU/CPU models, processor types, or TPU versions) used for running its main experiments or simulations. |
| Software Dependencies | No | The paper mentions software like 'Tensorflow' and 'tf.keras.layers.Embedding' but does not provide specific version numbers for these or any other ancillary software components used in the experiments. |
| Experiment Setup | Yes | For DP-SGD, we fine-tune the clipping norm and report the best accuracy achieved. When evaluating DP-FEST, we adjust the hyper-parameter k, which represents the number of preserved top buckets, with values ranging from 100 to 300,000. Regarding DP-Ada FEST, we tune the following hyper-parameters: The ratio of noise added to the contribution map to the one added to the sparse gradient, σ1/σ2, with options of 0.1 to 10. The thresholding value τ {0.5, 1.0, 5.0, 10.0, 20.0, 50.0, 100.0}. The clipping norm for gradient contribution C1 {1.0, 5.0, 10.0}. |